KR102657104B1

KR102657104B1 - Operation device of convolutional neural network, operation method of convolutional neural network and computer program stored in a recording medium to execute the method thereof

Info

Publication number: KR102657104B1
Application number: KR1020220063051A
Authority: KR
Inventors: 송진호; 노원우; 김현진; 안성우; 오윤호; 김보길
Original assignee: 연세대학교 산학협력단
Priority date: 2021-05-24
Filing date: 2022-05-23
Publication date: 2024-04-15
Also published as: KR20220158639A

Abstract

개시된 발명의 일 측면에 의하면, 입력 데이터 행렬의 성분마다 엘레멘트 아이디를 부여하고, 이러한 엘레멘트 아이디를 기초로 서로 중복되는 중복 성분들에 대하여 하나의 레지스터를 할당할 수 있어서, 일반적인 합성곱 연산 방법보다 연산 속도가 빠르고, 에너지 효율이 좋은 합성곱 연산 방법을 제공할 수 있다.
일 실시예에 따른 입력 데이터 행렬을 설정된 필터 행렬과 일반 행렬 곱(GEneral Matrix Multiplication) 연산하여 출력 데이터 행렬에 대응하는 특징 데이터 행렬을 생성하는 합성곱 연산 방법은, 적어도 하나의 프로세서에 의해, 상기 입력 데이터 행렬의 복수의 성분들 중 서로 중복되는 데이터를 나타내는 중복 성분들의 제1 목적 레지스터 주소들이 동일한 제2 목적 레지스터 주소에 대응하도록 레지스터 매핑 테이블을 갱신하는 단계; 및 적어도 하나의 프로세서에 의해, 레지스터 매핑 테이블에 기초하여 중복 성분들에 대해 동일한 제2 목적 레지스터 주소를 가지는 레지스터를 재사용하여 합성곱 연산을 수행하는 단계;를 포함할 수 있다.According to one aspect of the disclosed invention, an element ID can be assigned to each element of the input data matrix, and one register can be assigned to overlapping elements based on this element ID, making the operation more efficient than the general convolution operation method. It is possible to provide a convolution calculation method that is fast and energy efficient.
A convolution operation method of generating a feature data matrix corresponding to an output data matrix by performing a GEneral Matrix Multiplication operation on an input data matrix with a set filter matrix according to an embodiment includes, by at least one processor, the input updating a register mapping table so that first target register addresses of redundant components representing overlapping data among a plurality of components of a data matrix correspond to the same second target register address; and performing, by at least one processor, a convolution operation by reusing a register having the same second target register address for redundant components based on a register mapping table.

Description

A computer program stored in a recording medium to execute a convolution operation device, a convolution operation method, and a convolution operation method THEREOF}

본 발명은 인공신경망의 합성곱 연산 중 발생하는 중복 데이터에 대한 불필요한 메모리 접근을 제거하고, 레지스터 파일에 저장된 중복 데이터를 재사용함으로써 메모리 효율을 개선할 수 있는 합성곱 연산 장치 및 합성곱 연산 방법에 관한 것이다.The present invention relates to a convolution operation device and a convolution operation method that can improve memory efficiency by eliminating unnecessary memory access to redundant data that occurs during the convolution operation of an artificial neural network and reusing redundant data stored in a register file. will be.

합성곱 연산은 심층 인공신경망의 핵심 연산 중 하나로 사물 인식(object detection), 영상 분할(semantic segmentation), 이미지 생성(image generation) 등 많은 컴퓨터 분야에서 범용적으로 사용된다. 이러한 합성곱 연산은 자율주행, VR/AR 등 인공지능 및 애플리케이션에 널리 활용될 수 있는 연산 방식이다.Convolution operation is one of the core operations of deep artificial neural networks and is widely used in many computer fields such as object detection, semantic segmentation, and image generation. This convolution operation is a calculation method that can be widely used in artificial intelligence and applications such as autonomous driving and VR/AR.

심층 인공신경망의 많은 데이터와 연산량으로 인하여 범용 그래픽스 처리장치가 가속 하드웨어로서 사용되고 있다. 합성곱 연산은 심층 인공신경망의 전체 처리시간 중 대부분을 차지할 정도로 연산량이 많다.Due to the large amount of data and calculations of deep artificial neural networks, general-purpose graphics processing units are used as acceleration hardware. The convolution operation is so computationally intensive that it takes up most of the total processing time of a deep artificial neural network.

한편, 합성곱 연산 과정에 포함되어 있는 행렬 연산 과정에서는 워크스페이스 내에 다수의 데이터를 복제하여 이용하기 때문에 메모리 사용량과 접근횟수가 늘어나서 연산 속도가 느려지고, 에너지 효율이 낮아질 수 있다는 문제가 있다.On the other hand, since the matrix calculation process included in the convolution calculation process duplicates and uses a large number of data within the workspace, there is a problem that memory usage and number of accesses increase, slowing down calculation speed and reducing energy efficiency.

개시된 발명의 일 측면에 의하면, 입력 데이터 행렬의 성분마다 엘레멘트 아이디를 부여하고, 이러한 엘레멘트 아이디를 기초로 서로 중복되는 중복 성분들에 대하여 하나의 레지스터를 할당할 수 있어서, 일반적인 합성곱 연산 방법보다 연산 속도가 빠르고, 에너지 효율이 좋은 합성곱 연산 방법을 제공할 수 있다.According to one aspect of the disclosed invention, an element ID can be assigned to each element of the input data matrix, and one register can be assigned to overlapping elements based on this element ID, making the operation more efficient than the general convolution operation method. It is possible to provide a convolution calculation method that is fast and energy efficient.

개시된 발명의 일 측면에 따른 합성곱 연산 방법은, 입력 데이터 행렬을 설정된 필터 행렬과 일반 행렬 곱(GEneral Matrix Multiplication) 연산하여 출력 데이터 행렬에 대응하는 특징 데이터 행렬을 생성하는 합성곱 연산 방법에 있어서, 적어도 하나의 프로세서에 의해, 상기 입력 데이터 행렬의 복수의 성분들 중 서로 중복되는 데이터를 나타내는 중복 성분들의 제1 목적 레지스터 주소들이 동일한 제2 목적 레지스터 주소에 대응하도록 레지스터 매핑 테이블을 갱신하는 단계; 및 상기 적어도 하나의 프로세서에 의해, 상기 레지스터 매핑 테이블에 기초하여 상기 중복 성분들에 대해 상기 동일한 제2 목적 레지스터 주소를 가지는 레지스터를 재사용하여 합성곱 연산을 수행하는 단계;를 포함한다.The convolution calculation method according to one aspect of the disclosed invention is a convolution calculation method of generating a feature data matrix corresponding to an output data matrix by performing a GEneral Matrix Multiplication operation on an input data matrix and a set filter matrix, updating, by at least one processor, a register mapping table so that first destination register addresses of overlapping elements representing data that overlap each other among the plurality of elements of the input data matrix correspond to the same second destination register address; and performing, by the at least one processor, a convolution operation by reusing a register having the same second destination register address for the redundant components based on the register mapping table.

또한, 상기 레지스터 매핑 테이블을 갱신하는 단계는, 상기 복수의 성분들의 식별자를 생성하는 단계; 및 상기 복수의 성분들 중 동일한 식별자가 생성된 성분들의 제1 목적 레지스터 주소들이 동일한 제2 목적 레지스터 주소에 대응하도록 상기 레지스터 매핑 테이블을 갱신하는 단계;를 포함할 수 있다.Additionally, updating the register mapping table may include generating identifiers of the plurality of components; and updating the register mapping table so that first destination register addresses of components for which the same identifier is generated among the plurality of components correspond to the same second destination register address.

또한, 상기 식별자는, 엘레멘트 아이디를 포함하고, 상기 식별자를 생성하는 단계는, 상기 복수의 성분들의 어레이 인덱스, 상기 필터 행렬의 행의 개수 및 열의 개수 및 상기 출력 데이터 행렬의 열의 개수에 기초하여 상기 복수의 성분들의 패치 아이디를 생성하는 단계; 및 상기 패치 아이디 및 상기 복수의 성분들의 오프셋에 기초하여 상기 복수의 성분들의 엘레멘트 아이디를 생성하는 단계;를 포함하며, 상기 오프셋은, 상기 패치 아이디, 상기 입력 데이터 행렬의 열의 개수 및 채널의 개수에 기초하여 결정되는 값이고, 상기 어레이 인덱스는, 상기 복수의 성분들이 단일 차원 어레이로 배열되었을 때 각 성분들의 위치를 나타내는 값일 수 있다.In addition, the identifier includes an element ID, and the step of generating the identifier is based on the array index of the plurality of components, the number of rows and columns of the filter matrix, and the number of columns of the output data matrix. Generating patch IDs for a plurality of components; and generating element IDs of the plurality of components based on the patch ID and offsets of the plurality of components, wherein the offset is determined by the patch ID, the number of columns of the input data matrix, and the number of channels. It is a value determined based on the array index, and the array index may be a value indicating the position of each component when the plurality of components are arranged in a single-dimensional array.

또한, 상기 식별자는, 배치 아이디를 더 포함하고, 상기 식별자를 생성하는 단계는, 상기 복수의 성분들의 어레이 인덱스, 상기 필터 행렬의 행의 개수 및 열의 개수 및 출력 데이터 행렬의 행의 개수 및 열의 개수에 기초하여 상기 복수의 성분들의 배치 아이디를 생성하는 단계;를 더 포함할 수 있다.In addition, the identifier further includes a batch ID, and the step of generating the identifier includes an array index of the plurality of components, the number of rows and columns of the filter matrix, and the number of rows and columns of the output data matrix. It may further include generating batch IDs of the plurality of components based on .

또한, 상기 패치 아이디를 생성하는 단계는, 상기 복수의 성분들의 제1 패치 요소 및 제2 패치 요소를 산출하는 단계; 및 상기 제1 패치 요소에 상기 필터 행렬의 스트라이드를 곱한 값에 상기 제2 패치 요소를 더하여 상기 패치 아이디를 생성하는 단계;를 포함하고, 상기 제1 패치 요소는, 상기 복수의 성분들의 행 요소를 상기 출력 데이터 행렬의 열의 개수로 나누었을 때 출력되는 몫이고, 상기 제2 패치 요소는, 상기 복수의 성분들의 열 요소를 상기 필터 행렬의 열의 개수로 나누었을 때 출력되는 몫이며, 상기 행 요소는, 상기 어레이 인덱스를 상기 필터 행렬의 크기로 나누었을때 출력되는 몫이고, 상기 열 요소는, 상기 어레이 인덱스를 상기 필터 행렬의 크기로 나누었을때, 출력되는 나머지일 수 있다.Additionally, generating the patch ID may include calculating a first patch element and a second patch element of the plurality of components; And generating the patch ID by adding the second patch element to the value obtained by multiplying the first patch element by the stride of the filter matrix, wherein the first patch element includes row elements of the plurality of components. It is a quotient output when divided by the number of columns of the output data matrix, and the second patch element is a quotient output when the column elements of the plurality of components are divided by the number of columns of the filter matrix, and the row element is , a quotient output when the array index is divided by the size of the filter matrix, and the column element may be a remainder output when the array index is divided by the size of the filter matrix.

또한, 상기 엘레멘트 아이디를 생성하는 단계는, 상기 오프셋에 상기 행 요소를 상기 출력 데이터 행렬의 열의 개수에 상기 채널의 개수 및 상기 스트라이드를 곱한 값으로 나누었을 때 출력되는 나머지 값과, 상기 열 요소를 상기 필터 행렬의 열의 개수에 상기 채널의 개수를 곱한 값으로 나누었을 때 출력되는 나머지 값을 더하여 상기 엘레멘트 아이디를 생성하고, 상기 입력 데이터 행렬의 오프셋은 상기 패치 아이디에 상기 입력 데이터 행렬의 열의 개수 및 채널의 개수를 곱한 값일 수 있다.In addition, the step of generating the element ID includes the offset, the remaining value output when the row element is divided by the number of columns of the output data matrix multiplied by the number of channels and the stride, and the column element. The element ID is generated by adding the remaining value output when dividing the number of columns of the filter matrix by the number of channels, and the offset of the input data matrix is the patch ID, the number of columns of the input data matrix, and It may be a value multiplied by the number of channels.

또한, 상기 적어도 하나의 프로세서에 의해, 원본 입력 데이터 행렬의 크기, 복수의 성분들의 개수 및 순서를 워크스페이스에 해당하는 메모리 영역으로 변경하여 상기 입력 데이터 행렬을 생성하는 단계;를 더 포함하고, 상기 입력 데이터 행렬은, 상기 원본 입력 데이터 행렬이 필터 행렬과 GEMM 연산으로 상기 특징 데이터 행렬을 출력할 수 있도록, 상기 원본 입력 데이터 행렬의 복수의 성분들이 규칙을 가지고 재조합되어 배열된 행렬일 수 있다.The method further includes generating the input data matrix by changing the size of the original input data matrix and the number and order of a plurality of components to a memory area corresponding to a workspace, by the at least one processor, The input data matrix may be a matrix in which a plurality of components of the original input data matrix are reassembled and arranged according to a rule so that the feature data matrix can be output through a filter matrix and GEMM operation.

또한, 상기 입력 데이터 행렬을 생성하는 단계는, 상기 원본 입력 데이터 행렬을 특징 데이터 행렬의 행의 개수와 동일한 행의 개수를 가지고, 상기 필터 행렬의 크기와 동일한 열의 개수를 가지는 워크스페이스로 변환하여 상기 입력 데이터 행렬을 생성하는 단계;를 포함할 수 있다.Additionally, the step of generating the input data matrix includes converting the original input data matrix into a workspace having the same number of rows as the number of rows of the feature data matrix and the same number of columns as the size of the filter matrix. It may include generating an input data matrix.

또한, 상기 레지스터 매핑 테이블을 갱신하는 단계는, 텐서 코어 로드 데이터를 입력받는 단계; 상기 텐서 코어 로드 데이터에 포함된 제1 목적 레지스터 주소를 가지는 성분의 식별자 및 제2 목적 레지스터 주소가 로드 기록 버퍼에 기록되어 있는지 식별하는 단계; 상기 식별자 및 상기 제2 목적 레지스터 주소가 상기 로드 기록 버퍼에 기록되어 있지 않은 경우, 메모리에 접근하여 메모리 계층으로부터 상기 성분의 데이터를 패치하고, 상기 식별자 및 상기 패치된 데이터가 저장된 레지스터의 제2 목적 레지스터 주소를 상기 로드 기록 버퍼에 기록하고, 상기 로드 기록 버퍼에 기록된 제2 목적 레지스터 주소가 상기 제1 목적 레지스터 주소에 대응하도록 상기 레지스터 매핑 테이블을 갱신하는 단계; 및 상기 식별자 및 상기 제2 목적 레지스터 주소가 상기 로드 기록 버퍼에 기록되어 있는 경우, 메모리에 접근하지 않고 상기 로드 기록 버퍼에 기록된 제2 목적 레지스터 주소가 상기 제1 목적 레지스터 주소에 대응하도록 상기 레지스터 매핑 테이블을 갱신하는 단계;를 포함할 수 있다.Additionally, updating the register mapping table includes receiving tensor core load data; identifying whether an identifier of a component having a first destination register address and a second destination register address included in the tensor core load data are recorded in a load write buffer; If the identifier and the second destination register address are not written in the load write buffer, access memory to fetch data of the component from the memory layer, and set the second destination register in which the identifier and the fetched data are stored. writing a register address to the load write buffer and updating the register mapping table so that a second destination register address written in the load write buffer corresponds to the first destination register address; and when the identifier and the second destination register address are written in the load write buffer, the register is registered so that the second destination register address written in the load write buffer corresponds to the first destination register address without accessing memory. It may include a step of updating the mapping table.

개시된 발명의 일 측면에 따른 컴퓨터 프로그램은, 상기 합성곱 연산 방법을 실행시키도록 컴퓨터로 판독 가능한 기록매체에 저장된다.A computer program according to one aspect of the disclosed invention is stored in a computer-readable recording medium to execute the convolution calculation method.

개시된 발명의 일 측면에 따른 합성곱 연산 장치는, 합성곱 연산 장치에 있어서, 레지스터 매핑 테이블이 저장된 메모리; 및 입력 데이터 행렬을 설정된 필터 행렬과 일반 행렬 곱(GEneral Matrix Multiplication) 연산하여 출력 데이터 행렬에 대응하는 특징 데이터 행렬을 생성하는 합성곱 연산을 수행하는 프로세서;를 포함하고, 상기 프로세서는, 상기 입력 데이터 행렬의 복수의 성분들 중 서로 중복되는 데이터를 나타내는 중복 성분들의 제1 목적 레지스터 주소들이 동일한 제2 목적 레지스터 주소에 대응하도록 상기 레지스터 매핑 테이블을 갱신하고, 상기 레지스터 매핑 테이블에 기초하여 상기 중복 성분들에 대해 상기 동일한 제2 목적 레지스터 주소를 가지는 레지스터를 재사용하여 합성곱 연산을 수행한다.A convolution operation device according to one aspect of the disclosed invention includes: a memory storing a register mapping table; And a processor that performs a convolution operation to generate a feature data matrix corresponding to the output data matrix by performing a GEneral Matrix Multiplication operation on the input data matrix with a set filter matrix, wherein the processor performs a GEneral Matrix Multiplication operation on the input data matrix. The register mapping table is updated so that first destination register addresses of redundant elements representing overlapping data among a plurality of elements of a matrix correspond to the same second target register address, and the redundant elements are mapped based on the register mapping table. For , the convolution operation is performed by reusing the register having the same second destination register address.

또한, 상기 프로세서는, 상기 복수의 성분들의 식별자를 생성하고, 상기 복수의 성분들 중 동일한 식별자가 생성된 성분들의 제1 목적 레지스터 주소들이 동일한 제2 목적 레지스터 주소에 대응하도록 상기 레지스터 매핑 테이블을 갱신할 수 있다.In addition, the processor generates identifiers for the plurality of components, and updates the register mapping table so that first target register addresses of components for which the same identifier is generated among the plurality of components correspond to the same second target register address. can do.

또한, 상기 식별자는, 엘레멘트 아이디를 포함하고, 상기 프로세서는, 상기 복수의 성분들의 어레이 인덱스, 상기 필터 행렬의 행의 개수 및 열의 개수 및 상기 출력 데이터 행렬의 열의 개수에 기초하여 상기 복수의 성분들의 패치 아이디를 생성하고, 상기 패치 아이디 및 상기 복수의 성분들의 오프셋에 기초하여 상기 복수의 성분들의 엘레멘트 아이디를 생성하며, 상기 오프셋은, 상기 패치 아이디, 상기 입력 데이터 행렬의 열의 개수 및 채널의 개수에 기초하여 결정되는 값이고, 상기 어레이 인덱스는, 상기 복수의 성분들이 단일 차원 어레이로 배열되었을 때 각 성분들의 위치를 나타내는 값일 수 있다.In addition, the identifier includes an element ID, and the processor determines the array index of the plurality of components based on the number of rows and columns of the filter matrix and the number of columns of the output data matrix. A patch ID is generated, and element IDs of the plurality of components are generated based on the patch ID and the offset of the plurality of components, and the offset is determined by the patch ID, the number of columns of the input data matrix, and the number of channels. It is a value determined based on the array index, and the array index may be a value indicating the position of each component when the plurality of components are arranged in a single-dimensional array.

또한, 상기 식별자는, 배치 아이디를 더 포함하고, 상기 프로세서는, 상기 복수의 성분들의 어레이 인덱스, 상기 필터 행렬의 행의 개수 및 열의 개수 및 출력 데이터 행렬의 행의 개수 및 열의 개수에 기초하여 상기 복수의 성분들의 배치 아이디를 생성할 수 있다.In addition, the identifier further includes a batch ID, and the processor determines the array index of the plurality of components, the number of rows and columns of the filter matrix, and the number of rows and columns of the output data matrix. Batch IDs for multiple ingredients can be created.

또한, 상기 프로세서는, 상기 복수의 성분들의 제1 패치 요소 및 제2 패치 요소를 산출하고, 상기 제1 패치 요소에 상기 필터 행렬의 스트라이드를 곱한 값에 상기 제2 패치 요소를 더하여 상기 패치 아이디를 생성하고, 상기 제1 패치 요소는, 상기 복수의 성분들의 행 요소를 상기 출력 데이터 행렬의 열의 개수로 나누었을 때 출력되는 몫이고, 상기 제2 패치 요소는, 상기 복수의 성분들의 열 요소를 상기 필터 행렬의 열의 개수로 나누었을 때 출력되는 몫이며, 상기 행 요소는, 상기 어레이 인덱스를 상기 필터 행렬의 크기로 나누었을때 출력되는 몫이고, 상기 열 요소는, 상기 어레이 인덱스를 상기 필터 행렬의 크기로 나누었을때, 출력되는 나머지일 수 있다.In addition, the processor calculates a first patch element and a second patch element of the plurality of components, and calculates the patch ID by adding the second patch element to the value obtained by multiplying the first patch element by the stride of the filter matrix. The first patch element is a quotient output when the row elements of the plurality of components are divided by the number of columns of the output data matrix, and the second patch element is the column element of the plurality of components. It is a quotient output when divided by the number of columns of the filter matrix, the row element is a quotient output when the array index is divided by the size of the filter matrix, and the column element is the array index of the filter matrix. When divided by size, it may be the remaining output.

또한, 상기 프로세서는, 상기 오프셋에 상기 행 요소를 상기 출력 데이터 행렬의 열의 개수에 상기 채널의 개수 및 상기 스트라이드를 곱한 값으로 나누었을 때 출력되는 나머지 값과, 상기 열 요소를 상기 필터 행렬의 열의 개수에 상기 채널의 개수를 곱한 값으로 나누었을 때 출력되는 나머지 값을 더하여 상기 엘레멘트 아이디를 생성하고, 상기 입력 데이터 행렬의 오프셋은, 상기 패치 아이디에 상기 입력 데이터 행렬의 열의 개수 및 채널의 개수를 곱한 값일 수 있다.In addition, the processor divides the offset into the row element by the number of columns of the output data matrix multiplied by the number of channels and the stride, and divides the column element into a column of the filter matrix. The element ID is generated by adding the remaining value output when the number is divided by the number of channels, and the offset of the input data matrix is calculated by dividing the patch ID with the number of columns and the number of channels of the input data matrix. It may be a multiplied value.

또한, 상기 프로세서는, 원본 입력 데이터 행렬의 크기, 복수의 성분들의 개수 및 순서를 워크스페이스에 해당하는 메모리 영역으로 변경하여 상기 입력 데이터 행렬을 생성하고, 상기 입력 데이터 행렬은, 상기 원본 입력 데이터 행렬이 필터 행렬과 GEMM 연산으로 상기 특징 데이터 행렬을 출력할 수 있도록, 상기 원본 입력 데이터 행렬의 복수의 성분들이 규칙을 가지고 재조합되어 배열된 행렬일 수 있다.In addition, the processor generates the input data matrix by changing the size of the original input data matrix and the number and order of a plurality of components to a memory area corresponding to the workspace, and the input data matrix is the original input data matrix. It may be a matrix in which a plurality of components of the original input data matrix are reassembled and arranged according to a rule so that the feature data matrix can be output through the filter matrix and GEMM operation.

또한, 상기 프로세서는, 상기 원본 입력 데이터 행렬을 특징 데이터 행렬의 행의 개수와 동일한 행의 개수를 가지고, 상기 필터 행렬의 크기와 동일한 열의 개수를 가지는 워크스페이스로 변환하여 상기 입력 데이터 행렬을 생성할 수 있다.In addition, the processor generates the input data matrix by converting the original input data matrix into a workspace having the same number of rows as the number of rows of the feature data matrix and the same number of columns as the size of the filter matrix. You can.

또한, 상기 프로세서는, 텐서 코어 로드 데이터를 입력받고, 상기 텐서 코어 로드 데이터에 포함된 제1 목적 레지스터 주소를 가지는 성분의 식별자 및 제2 목적 레지스터 주소가 로드 기록 버퍼에 기록되어 있는지 식별하고, 상기 식별자 및 상기 제2 목적 레지스터 주소가 상기 로드 기록 버퍼에 기록되어 있지 않은 경우, 상기 성분의 데이터를 패치하고, 상기 식별자 및 상기 패치된 데이터가 저장된 레지스터의 제2 목적 레지스터 주소를 상기 로드 기록 버퍼에 기록하고, 상기 로드 기록 버퍼에 기록된 제2 목적 레지스터 주소가 상기 제1 목적 레지스터 주소에 대응하도록 상기 레지스터 매핑 테이블을 갱신하고, 상기 식별자 및 상기 제2 목적 레지스터 주소가 상기 로드 기록 버퍼에 기록되어 있는 경우, 상기 로드 기록 버퍼에 기록된 제2 목적 레지스터 주소가 상기 제1 목적 레지스터 주소에 대응하도록 상기 레지스터 매핑 테이블을 갱신할 수 있다.In addition, the processor receives tensor core load data, identifies whether the identifier of the component having the first destination register address and the second destination register address included in the tensor core load data are recorded in the load write buffer, and If the identifier and the second destination register address are not written in the load write buffer, the data of the component is fetched, and the second destination register address of the register where the identifier and the fetched data are stored is written to the load write buffer. Write, update the register mapping table so that the second destination register address written in the load write buffer corresponds to the first destination register address, and the identifier and the second destination register address are written in the load write buffer. If so, the register mapping table can be updated so that the second destination register address written in the load write buffer corresponds to the first destination register address.

개시된 발명의 일 측면에 따르면, 원본 입력 데이터 행렬을 워크스페이스로 변환하고, 변환된 입력 데이터 행렬의 각 성분에 대하여 서로 중복되는 중복 성분들에게 공통 목적 레지스터를 할당하고, 공통 목적 레지스터에 저장된 입력 데이터 행렬의 데이터를 재사용하여 일반적인 합성곱 연산보다 연산 속도가 빠르고 에너지 효율이 좋은 연산을 수행할 수 있다.According to one aspect of the disclosed invention, the original input data matrix is converted into a workspace, a common purpose register is assigned to the overlapping elements for each element of the converted input data matrix, and the input data stored in the common purpose register is provided. By reusing matrix data, you can perform operations that are faster and more energy efficient than general convolution operations.

도 1은 일 실시예에 따른 합성곱 연산 장치의 구성도이다.
도 2는 일 실시예에 따른 합성곱 연산 장치의 또다른 구성도이다.
도 3은 일 실시예에서 로드 기록 버퍼를 사용하는 것을 설명하기 위한 표이다.
도 4는 일 실시예에 따른 아이디 생성기 및 로드 기록 버퍼의 구성을 나타낸 도면이다.
도 5는 일 실시예에 따라 제2 목적 레지스터 주소가 로드 기록 버퍼에 기록되어 있는지 식별하는 것을 설명하기 위한 도면이다.
도 6은 일 실시예에 따라 원본 입력 데이터 행렬을 필터 행렬과 합성곱 연산을 수행하는 것을 설명하기 위한 도면이다.
도 7은 일 실시예에 따라 워크스페이스로 변환된 입력 데이터 행렬을 나타낸 도면이다.
도 8은 일 실시예에 따른 입력 데이터 행렬의 성분들이 규칙을 가지고 재조합되어 배열된 행렬임을 나타내는 도면이다.
도 9는 일 실시예에 따른 입력 데이터 행렬의 성분들이 규칙을 가지고 배열된 행렬임을 나타내는 또다른 도면이다.
도 10은 일 실시예에 따라 생성된 입력 데이터 행렬의 복수의 성분들의 패치 아이디를 설명하기 위한 도면이다.
도 11은 일 실시 예에 따라 생성된 입력 데이터 행렬(202)의 복수의 성분들의 엘레멘트 아이디를 설명하기 위한 도면이다.
도 12는 또 다른 일 실시 예에 따른 아이디 생성기 및 로드 기록 버퍼의 구성을 나타낸 도면이다.
도 13은 일 실시예에 따른 합성곱 연산 방법의 순서도이다.
도 14는 본 발명의 합성곱 연산 방법의 효과를 나타내는 그래프이다.1 is a configuration diagram of a convolution operation device according to an embodiment.
Figure 2 is another configuration diagram of a convolution operation device according to an embodiment.
Figure 3 is a table explaining the use of a load write buffer in one embodiment.
Figure 4 is a diagram showing the configuration of an ID generator and a load record buffer according to an embodiment.
FIG. 5 is a diagram illustrating identifying whether a second destination register address is written in a load write buffer according to an embodiment.
FIG. 6 is a diagram illustrating performing a convolution operation on an original input data matrix and a filter matrix according to an embodiment.
Figure 7 is a diagram showing an input data matrix converted to a workspace according to an embodiment.
FIG. 8 is a diagram illustrating a matrix in which components of an input data matrix according to an embodiment are reassembled and arranged according to a rule.
FIG. 9 is another diagram illustrating a matrix in which components of an input data matrix according to an embodiment are arranged according to a rule.
FIG. 10 is a diagram illustrating patch IDs of a plurality of components of an input data matrix generated according to an embodiment.
FIG. 11 is a diagram for explaining element IDs of a plurality of components of the input data matrix 202 generated according to an embodiment.
Figure 12 is a diagram showing the configuration of an ID generator and a load record buffer according to another embodiment.
Figure 13 is a flowchart of a convolution calculation method according to an embodiment.
Figure 14 is a graph showing the effect of the convolution calculation method of the present invention.

명세서 전체에 걸쳐 동일 참조 부호는 동일 구성요소를 지칭한다. 본 명세서가 실시예들의 모든 요소들을 설명하는 것은 아니며, 개시된 발명이 속하는 기술분야에서 일반적인 내용 또는 실시예들 간에 중복되는 내용은 생략한다. 명세서에서 사용되는 '~모듈'이라는 용어는 소프트웨어 또는 하드웨어로 구현될 수 있으며, 실시예들에 따라 복수의 '~모듈'이 하나의 구성요소로 구현되거나, 하나의 '모듈'이 복수의 구성요소들을 포함하는 것도 가능하다.Like reference numerals refer to like elements throughout the specification. This specification does not describe all elements of the embodiments, and general content or overlapping content between the embodiments in the technical field to which the disclosed invention pertains is omitted. The term '˜module' used in the specification may be implemented as software or hardware, and depending on embodiments, multiple '˜modules' may be implemented as one component, or one 'module' may be implemented as a plurality of components. It is also possible to include them.

또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다.Additionally, when a part "includes" a certain component, this means that it may further include other components rather than excluding other components, unless specifically stated to the contrary.

제1, 제2 등의 용어는 하나의 구성요소를 다른 구성요소로부터 구별하기 위해 사용되는 것으로, 구성요소가 전술된 용어들에 의해 제한되는 것은 아니다.Terms such as first and second are used to distinguish one component from another component, and the components are not limited by the above-mentioned terms.

단수의 표현은 문맥상 명백하게 예외가 있지 않는 한, 복수의 표현을 포함한다.Singular expressions include plural expressions unless the context clearly makes an exception.

각 단계들에 있어 식별부호는 설명의 편의를 위하여 사용되는 것으로 식별부호는 각 단계들의 순서를 설명하는 것이 아니며, 각 단계들은 문맥상 명백하게 특정 순서를 기재하지 않는 이상 명기된 순서와 다르게 실시될 수 있다.The identification code for each step is used for convenience of explanation. The identification code does not explain the order of each step, and each step may be performed differently from the specified order unless a specific order is clearly stated in the context. there is.

이하 첨부된 도면들을 참고하여 개시된 발명의 작용 원리 및 실시예들에 대해 설명한다.Hereinafter, the operating principle and embodiments of the disclosed invention will be described with reference to the attached drawings.

도 1은 일 실시예에 따른 합성곱 연산 장치의 구성도이며, 도 2는 일 실시예에 따른 합성곱 연산 장치의 또다른 구성도이다.FIG. 1 is a configuration diagram of a convolution operation device according to an embodiment, and FIG. 2 is another configuration diagram of a convolution operation device according to an embodiment.

도 1을 참조하면, 본 발명의 실시예에 따른 합성곱 연산 장치(100)는, 프로세서(110), 메모리(120), 아이디 생성기(130), 로드 기록 버퍼(140), 레지스터(150)를 포함할 수 있다.Referring to FIG. 1, the convolution operation device 100 according to an embodiment of the present invention includes a processor 110, a memory 120, an ID generator 130, a load write buffer 140, and a register 150. It can be included.

프로세서(110)는 합성곱 연산을 수행할 수 있다. 구체적으로, 프로세서(110)는 합성곱 연산을 수행하기 위하여 입력 데이터 행렬을 설정된 필터 행렬과 일반 행렬곱(GEneral Matrix Multiplication)연산을 수행하여 출력 데이터 행렬에 대응되는 특징 데이터 행렬을 생성할 수 있다.The processor 110 may perform a convolution operation. Specifically, in order to perform a convolution operation, the processor 110 may generate a feature data matrix corresponding to the output data matrix by performing a GEneral Matrix Multiplication operation on the input data matrix with a set filter matrix.

여기에서, 입력 데이터 행렬은 원본 입력 데이터 행렬이 워크스페이스로 변환된 행렬일 수 있다. 구체적으로, 입력 데이터 행렬은 원본 입력 데이터 행렬이 필터 행렬과 일반 행렬 곱 연산으로 특징 데이터 행렬을 출력할 수 있도록, 원본 입력 데이터 행렬의 복수의 성분들이 규칙을 가지고 재조합되어 배열된 행렬일 수 있다. 특징 데이터 행렬은 합성곱 연산으로 생성되는 출력 데이터 행렬에 대응되는 행렬일 수 있다. 구체적으로, 특징 데이터 행렬은 설정된 필터 행렬이 이용되는 GEMM 연산에서 입력 데이터 행렬을 입력값으로 했을 때의 결과값일 수 있다. 또한, 특징 데이터 행렬은 1개의 열로 된 행렬일 수 있다.Here, the input data matrix may be a matrix converted from the original input data matrix to the workspace. Specifically, the input data matrix may be a matrix in which a plurality of components of the original input data matrix are reassembled and arranged according to a rule so that a feature data matrix can be output through a filter matrix and a general matrix multiplication operation. The feature data matrix may be a matrix corresponding to the output data matrix generated through a convolution operation. Specifically, the feature data matrix may be a result value when the input data matrix is used as an input value in a GEMM operation in which a set filter matrix is used. Additionally, the feature data matrix may be a matrix with one column.

메모리(120)는 입력 데이터 행렬의 복수의 성분들의 데이터 규칙에 기초하여 설정되는 레지스터 매핑 테이블을 저장할 수 있다.The memory 120 may store a register mapping table set based on data rules of a plurality of components of the input data matrix.

프로세서(110)는 메모리(120)에 저장된 레지스터 매핑 테이블을 갱신할 수 있다. The processor 110 may update the register mapping table stored in the memory 120.

구체적으로, 프로세서(110)는 입력 데이터 행렬의 복수의 성분들 중 서로 중복되는 데이터를 나타내는 중복 성분들의 제1 목적 레지스터 주소들이 동일한 제2 목적 레지스터 주소에 대응하도록 레지스터 매핑 테이블을 갱신할 수 있다.Specifically, the processor 110 may update the register mapping table so that first target register addresses of redundant components representing overlapping data among a plurality of components of the input data matrix correspond to the same second target register address.

여기에서, 중복 성분은 입력 데이터 행렬의 성분들 중 서로 중복되는 성분일 수 있다. Here, the redundant component may be a component that overlaps with each other among the components of the input data matrix.

구체적으로, 원본 입력 데이터 행렬이 워크스페이스로 변환되는 과정에서 행렬의 크기는 확장되고, 원본 입력 데이터 행렬에서의 특정 성분이 위치를 달리하여 중복하여 배열될 수 있다. 이와 같이, 입력 데이터 행렬의 성분들 중 동일한 원본 입력 데이터 행렬의 성분에 기초하여 변환된 성분들은 서로 중복 성분이 될 수 있다. Specifically, in the process of converting the original input data matrix to the workspace, the size of the matrix is expanded, and specific components in the original input data matrix may be arranged in overlapping positions at different positions. In this way, among the components of the input data matrix, components converted based on components of the same original input data matrix may become overlapping components.

그리고, 프로세서(110)는 레지스터 매핑 테이블에 기초하여 중복 성분들 대해 동일한 제2 목적 레지스터 주소를 가지는 레지스터(150)를 재사용하여 입력 데이터 행렬과 필터 행렬 간의 합성곱 연산을 수행하도록 구성될 수 있다.Additionally, the processor 110 may be configured to perform a convolution operation between the input data matrix and the filter matrix by reusing the register 150 having the same second destination register address for the redundant components based on the register mapping table.

즉, 입력 데이터 행렬의 성분들은 중복 성분들을 포함한다는 점에서, GEMM연산을 수행하는데 현재 필요한 데이터가 이미 메모리(120)에서 불러온 적이 있는 경우, 해당 데이터가 저장된 레지스터(150)가 있을 수 있다. That is, given that the components of the input data matrix include redundant components, if the data currently needed to perform the GEMM operation has already been loaded from the memory 120, there may be a register 150 in which the data is stored.

이 경우, 프로세서(110)는 굳이 다시 메모리(120)에 접근하여 해당 데이터를 불러오지 않고, 해당 데이터가 저장된 레지스터(150)에서 불러와 GEMM 연산을 수행할 수 있다는 점에서, 메모리(120)에 대한 불필요한 액세스를 방지할 수 있다.In this case, the processor 110 does not necessarily access the memory 120 again to load the corresponding data, but rather reads the corresponding data from the register 150 where it is stored and performs the GEMM operation. Unnecessary access can be prevented.

도2를 참조하면, 합성곱 연산 장치(100)는 레지스터 매핑 테이블(121)을 포함할 수 있다.Referring to Figure 2, the convolution operation device 100 may include a register mapping table 121.

레지스터 매핑 테이블(121)은 복수의 제1 목적 레지스터 주소들과, 복수의 제2 목적 레지스터 주소들 간의 대응 관계를 나타내는 테이블일 수 있다.The register mapping table 121 may be a table representing the correspondence between a plurality of first destination register addresses and a plurality of second destination register addresses.

제1 목적 레지스터 주소는 입력 데이터 행렬의 복수의 성분들 각각에 대응하는 논리적 주소(logical address) 또는 가상 주소(virtual address)를 의미할 수 있다.The first target register address may mean a logical address or a virtual address corresponding to each of a plurality of components of the input data matrix.

제2 목적 레지스터 주소는 복수 개의 레지스터(150)들 각각에 대응하는 물리적 주소(physical address)를 의미할 수 있다. The second destination register address may mean a physical address corresponding to each of the plurality of registers 150.

한편, 해당 성분에 대한 제2 목적 레지스터 주소는 해당 성분 및 해당 성분과 중복된 성분에 대하여 동일하게 설정될 수 있다.Meanwhile, the second destination register address for the corresponding component may be set to be the same for the corresponding component and components overlapping with the corresponding component.

구체적으로, 제1 목적 레지스터 주소는 입력 데이터 행렬의 복수의 성분들 각각에 대응되므로, 입력 데이터 행렬의 성분마다 다를 수 있다. 예를 들어, 입력 데이터 행렬의 1행 2열의 성분의 제1 목적 레지스터 주소와 입력 데이터 행렬의 2행 1열의 성분의 제1 목적 레지스터 주소는 다를 수 있다.Specifically, since the first destination register address corresponds to each of the plurality of components of the input data matrix, it may be different for each component of the input data matrix. For example, the first destination register address of the component in the 1st row and 2nd column of the input data matrix may be different from the first destination register address of the component in the 2nd row and 1st column of the input data matrix.

반면, 제2 목적 레지스터 주소는 중복된 성분들에 대하여 동일하게 설정될 수 있다. 즉, 만약 입력 데이터 행렬의 1행 2열의 성분과 입력 데이터 행렬의 2행 1열의 성분이 서로 중복되는 성분이라면, 입력 데이터 행렬의 1행 2열의 성분의 제1 목적 레지스터 주소와 입력 데이터 행렬의 2행 1열의 성분의 제2 목적 레지스터 주소는 동일할 수 있다.On the other hand, the second destination register address may be set to be the same for overlapping components. That is, if the component in row 1, column 2 of the input data matrix and the component in row 2, column 1 of the input data matrix are components that overlap with each other, the first destination register address of the component in row 1, column 2 of the input data matrix and 2 of the input data matrix The second destination register address of the element in row 1 may be the same.

그리고, 제1 목적 레지스터 주소들은 각각 대응되는 제2 목적 레지스터 주소들이 있을 수 있다. 즉, 제1 목적 레지스터 주소에 대응되는 성분의 데이터는 해당 제1 목적 레지스터 주소에 대응되는 제2 목적 레지스터 주소를 가지는 레지스터(150)에 저장되거나 저장될 수 있으며, 이러한 제1 목적 레지스터 주소와 제2 목적 레지스터 주소간의 대응 관계는 레지스터 매핑 테이블(121)에 나타날 수 있다.Additionally, each first destination register address may have a corresponding second destination register address. That is, the data of the component corresponding to the first destination register address is or may be stored in the register 150 having a second destination register address corresponding to the first destination register address, and this first destination register address and the second destination register address may be stored in the register 150. 2 Correspondence between destination register addresses may appear in the register mapping table 121.

프로세서(110)는 레지스터 매핑 테이블(121)에 기초하여, 중복 성분들에 대하여 동일한 제2 목적 레지스터 주소를 가지는 레지스터(150)를 재사용하여 합성곱 연산을 수행할 수 있다.The processor 110 may perform a convolution operation by reusing the register 150 having the same second destination register address for redundant components based on the register mapping table 121.

구체적으로, 프로세서(110)는 레지스터 매핑 테이블(121)에 기초하여 제1 목적 레지스터 주소를 이에 대응되는 제2 목적 레지스터 주소로 변환할 수 있다.Specifically, the processor 110 may convert a first target register address into a corresponding second target register address based on the register mapping table 121.

그리고, 프로세서(110)는 입력 데이터 행렬의 복수의 성분들 중 특정 제1 목적 레지스터의 성분의 데이터가 이용되는 연산을 수행하는 경우, 해당 제1 목적 레지스터 주소가 변환된 제2 목적 레지스터 주소를 가지는 레지스터(150)로부터 해당 데이터를 획득하여 연산을 수행할 수 있다.And, when the processor 110 performs an operation using data of a specific first target register component among a plurality of components of the input data matrix, the first target register address has a converted second target register address. Operations can be performed by obtaining the corresponding data from the register 150.

아이디 생성기(130)는 입력 데이터 행렬의 복수의 성분들의 식별자를 생성할 수 있다. 여기에서, 식별자는 입력 데이터 행렬의 복수의 성분들 각각에 대하여 하나씩 생성될 수 있다. 이 경우, 아이디 생성기(130)는, 입력 데이터 행렬의 복수의 성분들 중 중복 성분들에 대하여 동일한 식별자를 생성할 수 있다.The ID generator 130 may generate identifiers of a plurality of components of the input data matrix. Here, identifiers may be generated one by one for each of the plurality of components of the input data matrix. In this case, the ID generator 130 may generate the same identifier for duplicate components among the plurality of components of the input data matrix.

구체적으로, 아이디 생성기(130)는 로드 인스트럭션의 메모리 주소와 합성곱 연산의 인자(입력 데이터 행렬의 크기/폭/채널수, 필터 크기/폭/이동거리 등)를 기반으로 중복 성분들에 동일한 식별자를 생성함으로써 입력 데이터 행렬의 복수의 성분들 간의 중복성 여부가 판별될 수 있다. 이를 위해, 아이디 생성기(130)는 합성곱 연산이 시작할 때 해당 인자들에 맞게 프로그래밍될 수 있다.Specifically, the ID generator 130 assigns the same identifier to the redundant components based on the memory address of the load instruction and the factors of the convolution operation (size/width/number of channels of the input data matrix, filter size/width/movement distance, etc.). By generating , redundancy between a plurality of components of the input data matrix can be determined. To this end, the ID generator 130 can be programmed for the corresponding parameters when the convolution operation begins.

프로세서(110)는 식별자에 기초하여 레지스터 매핑 테이블(121)을 갱신할 수 있다. 즉, 프로세서(110)는 입력 데이터 행렬의 복수의 성분들 중 동일한 식별자를 가지는 성분들의 제1 목적 레지스터 주소가 동일한 제2 목적 레지스터에 대응하도록 레지스터 매핑 테이블(121)을 갱신할 수 있다.The processor 110 may update the register mapping table 121 based on the identifier. That is, the processor 110 may update the register mapping table 121 so that the first target register addresses of components having the same identifier among the plurality of components of the input data matrix correspond to the same second target register.

로드 기록 버퍼(140)에는 앞서 로드된 성분들의 식별자와 해당 성분이 저장된 레지스터의 제2 목적 레지스터 주소가 저장될 수 있다.The load write buffer 140 may store the identifiers of previously loaded components and the second destination register address of the register where the corresponding components are stored.

그리고, 프로세서(110)는 입력 데이터 행렬의 특정 성분에 대한 로드 명령이 수신되면, 아이디 생성기(130)가 생성한 식별자를 기반으로 앞선 기록 중 해당 성분과 동일한 식별자를 가졌던 성분에 대한 기록이 있는지 로드 기록 버퍼(140)를 통해 확인할 수 있다.Then, when a load command for a specific component of the input data matrix is received, the processor 110 loads whether there is a record of a component that had the same identifier as the corresponding component among previous records based on the identifier generated by the ID generator 130. This can be confirmed through the recording buffer 140.

그리고, 프로세서(110)는 로드 기록 버퍼(140)에 동일한 식별자를 가졌던 성분에 대한 기록이 존재하는 것으로 확인되면, 해당 성분의 데이터를 메모리(120)로부터 읽어오는 대신 로드 기록 버퍼(140)가 알려준 제2 목적 레지스터 주소를 가지는 레지스터(150)에 저장된 값을 재사용함으로써 중복 데이터를 반복적으로 읽는 불필요한 메모리 접근을 제거할 수 있다.And, when the processor 110 confirms that there is a record of a component with the same identifier in the load record buffer 140, instead of reading the data of the corresponding component from the memory 120, the processor 110 reads the information provided by the load record buffer 140. By reusing the value stored in the register 150 having the second destination register address, unnecessary memory access to repeatedly read redundant data can be eliminated.

이처럼 본 발명의 합성곱 연산 방법은 합성곱 연산을 이용하는 심층 인공신경망에서 서로 다른 메모리 주소에 위치한 중복 데이터를 효율적으로 검출하고, 이용함으로써 데이터의 재사용 횟수를 획기적으로 늘리는 효과가 있다. 결과적으로 본 발명이 제안하는 기술은 많은 양의 데이터를 재사용하고 불필요한 메모리 접근을 제거함으로써 범용 그래픽스 처리장치의 성능 개선을 이끌어낼 수 있다.As such, the convolution calculation method of the present invention has the effect of dramatically increasing the number of reuses of data by efficiently detecting and using duplicate data located at different memory addresses in a deep artificial neural network using convolution calculation. As a result, the technology proposed by the present invention can improve the performance of general-purpose graphics processing devices by reusing a large amount of data and eliminating unnecessary memory access.

아이디 생성기(130)는 합성곱 연산 장치(100)에 포함된 복수개의 프로세서(110) 중 어느 하나의 프로세서(110)를 포함할 수 있다. 또한, 지금까지 설명된 본 발명의 실시예 및 앞으로 설명할 실시예에 따른 합성곱 연산 방법은, 프로세서(110) 및 아이디 생성기(130)에 의해 구동될 수 있는 프로그램의 형태로 구현될 수 있다.The ID generator 130 may include one processor 110 among the plurality of processors 110 included in the convolution operation device 100. Additionally, the convolution calculation method according to the embodiments of the present invention described so far and the embodiments to be described in the future may be implemented in the form of a program that can be driven by the processor 110 and the ID generator 130.

여기서 프로그램은, 프로그램 명령, 데이터 파일 및 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 프로그램은 기계어 코드나 고급 언어 코드를 이용하여 설계 및 제작된 것일 수 있다. 프로그램은 상술한 부호 수정을 위한 방법을 구현하기 위하여 특별히 설계된 것일 수도 있고, 컴퓨터 소프트웨어 분야에서 통상의 기술자에게 기 공지되어 사용 가능한 각종 함수나 정의를 이용하여 구현된 것일 수도 있다. 전술한 정보 표시 방법을 구현하기 위한 프로그램은, 프로세서(110) 및 아이디 생성기(130)에 의해 판독 가능한 기록매체에 기록될 수 있다. 이때, 기록매체는 메모리(120)일 수 있다.Here, the program may include program instructions, data files, and data structures, etc., singly or in combination. Programs may be designed and produced using machine code or high-level language code. The program may be specially designed to implement the above-described method for modifying the code, or may be implemented using various functions or definitions known and available to those skilled in the art in the computer software field. A program for implementing the above-described information display method may be recorded on a recording medium readable by the processor 110 and the ID generator 130. At this time, the recording medium may be the memory 120.

메모리(120)는 전술한 동작 및 후술하는 동작을 수행하는 프로그램을 저장할 수 있으며, 메모리(120)는 저장된 프로그램을 실행시킬 수 있다. 프로세서(110)와 메모리(120)가 복수인 경우에, 이들이 하나의 칩에 집적되는 것도 가능하고, 물리적으로 분리된 위치에 마련되는 것도 가능하다. 메모리(120)는 데이터를 일시적으로 기억하기 위한 S램(Static Random Access Memory, S-RAM), D랩(Dynamic Random Access Memory) 등의 휘발성 메모리를 포함할 수 있다. 또한, 메모리(120)는 제어 프로그램 및 제어 데이터를 장기간 저장하기 위한 롬(Read Only Memory), 이피롬(Erasable Programmable Read Only Memory: EPROM), 이이피롬(Electrically Erasable Programmable Read Only Memory: EEPROM) 등의 비휘발성 메모리를 포함할 수 있다.The memory 120 can store programs that perform the operations described above and the operations described later, and the memory 120 can execute the stored programs. When the processor 110 and the memory 120 are plural, they may be integrated into one chip or may be provided in physically separate locations. The memory 120 may include volatile memory such as Static Random Access Memory (S-RAM) or Dynamic Random Access Memory (D-Lab) for temporarily storing data. In addition, the memory 120 includes read only memory (ROM), erasable programmable read only memory (EPROM), and electrically erasable programmable read only memory (EEPROM) for long-term storage of control programs and control data. May include non-volatile memory.

프로세서(110) 및 아이디 생성기(130)는 각종 논리 회로와 연산 회로를 포함할 수 있으며, 메모리(120)로부터 제공된 프로그램에 따라 데이터를 처리하고, 처리 결과에 따라 제어 신호를 생성할 수 있다.The processor 110 and the ID generator 130 may include various logic circuits and operation circuits, process data according to a program provided from the memory 120, and generate control signals according to the processing results.

도 3은 일 실시예에서 로드 기록 버퍼를 사용하는 것을 설명하기 위한 표이다.Figure 3 is a table explaining the use of a load write buffer in one embodiment.

종래의 기술은 캐시를 이용하여 데이터의 지역성(locality)을 활용하는데, 심층 인공신경망의 데이터는 동일한 중복 성분들의 데이터가 서로 다른 메모리 주소에 위치하기 때문에 일반적인 캐시로는 데이터의 지역성을 전혀 활용할 수가 없다. 반면, 본 발명의 합성곱 연산 방법은 다양한 인자를 고려하여 동일한 중복 성분들의 데이터를 같은 코어에 할당함으로써 중복 데이터 검출 및 제거 효과를 최대화할 수 있다.Conventional technology utilizes the locality of data using a cache, but since the data of the deep artificial neural network has data of the same redundant components located in different memory addresses, the locality of the data cannot be utilized at all by a general cache. . On the other hand, the convolution calculation method of the present invention can maximize the effect of detecting and removing duplicate data by allocating data of the same redundant components to the same core by considering various factors.

도3을 참조하면, 텐서 코어 로드 데이터(500)는 제1 목적 레지스터 주소(501)를 포함할 수 있다.Referring to Figure 3, tensor core load data 500 may include a first destination register address 501.

프로세서(110)는 텐서 코어 로드 데이터(500)를 입력 받을 수 있다. 구체적으로, 프로세서(110)는 현재 진행중인 합성곱 연산 과정에 대한 텐서 코어 로드 데이터(500)를 입력 받을 수 있다.The processor 110 may receive tensor core load data 500. Specifically, the processor 110 may receive tensor core load data 500 for a convolution operation process currently in progress.

텐서 코어 로드 데이터(500)는 텐서 코어 로드 명령에 포함된 데이터일 수 있다. 이때, 프로세서(110)는 텐서 코어 로드 명령에 기초하여 이미 이전 연산에서 이용되어 레지스터(150)에 저장된 중복 데이터를 로드할 수 있다.Tensor core load data 500 may be data included in a tensor core load command. At this time, the processor 110 may load redundant data already used in a previous operation and stored in the register 150 based on the tensor core load command.

텐서 코어 로드 데이터(500)는 각각 하나의 제1 목적 레지스터 주소(501)를 포함할 수 있다. 구체적으로, 텐서 코어 로드 명령은 합성곱 연산을 수행하기 위하여 제1 목적 레지스터 주소(501)의 성분의 데이터를 로드하는 명령을 포함할 수 있다. Each tensor core load data 500 may include one first destination register address 501. Specifically, the tensor core load instruction may include an instruction to load data of a component of the first destination register address 501 to perform a convolution operation.

즉, 어느 한 텐서 코어 로드 명령이 입력되면 해당 텐서 코어 로드 명령에 포함된 제1 목적 레지스터 주소에 해당하는 입력 데이터 행렬의 특정한 성분의 데이터가 로드되어야 할 수 있다. That is, when a tensor core load command is input, data of a specific element of the input data matrix corresponding to the first destination register address included in the tensor core load command may be loaded.

한편, 텐서 코어 로드 데이터(500)는 각각 하나의 어레이 인덱스(700)에 대응될 수 있다. 이때, 어레이 인덱스(700)는 로드되어야 하는 성분의 행렬에서의 위치에 대한 인자일 수 있다.Meanwhile, the tensor core load data 500 may each correspond to one array index 700. At this time, the array index 700 may be a factor for the position in the matrix of the component to be loaded.

구체적으로, 어레이 인덱스(700)는 입력 데이터 행렬의 복수의 성분들이 단일 차원 어레이로 배열되었을 때 각 성분들의 위치를 나타내는 값일 수 있다.Specifically, the array index 700 may be a value indicating the position of each component when a plurality of components of the input data matrix are arranged in a single-dimensional array.

결과적으로, 어레이 인덱스(700)는 텐서 코어 로드 명령에 의해 필요한 성분의 입력 데이터 행렬에서의 위치에 관한 정보일 수 있다. 또한, 각 어레이 인덱스(700)는 하나의 제1 목적 레지스터 주소에 대응될 수 있다.As a result, the array index 700 may be information about the position in the input data matrix of the element required by the tensor core load instruction. Additionally, each array index 700 may correspond to one first destination register address.

프로세서(110)는 입력 데이터 행렬의 복수의 성분들의 식별자를 생성할 수 있다.Processor 110 may generate identifiers of a plurality of components of the input data matrix.

그리고, 프로세서(110)는 입력 데이터 행렬의 복수의 성분들 중 동일한 식별자가 생성된 성분들의 제1 목적 레지스터 주소(501)들이 동일한 제2 목적 레지스터 주소(142)에 대응하도록 레지스터 매핑 테이블(121)을 갱신할 수 있다. In addition, the processor 110 sets the register mapping table 121 so that the first destination register addresses 501 of the components for which the same identifier is generated among the plurality of components of the input data matrix correspond to the same second destination register address 142. can be updated.

여기에서, 본 개시의 일 실시 예에 따른 식별자는 도 3과 같이, 엘레멘트 아이디(141)를 포함할 수 있다.Here, the identifier according to an embodiment of the present disclosure may include an element ID 141, as shown in FIG. 3.

여기에서, 텐서 코어 로드 명령의 연산에 필요한 입력 데이터 행렬 성분의 엘레멘트 아이디(141)는 해당 연산에 필요한 성분의 행렬에서의 위치, 즉 어레이 인덱스(700)에 기초하여 생성될 수 있다.Here, the element ID 141 of the input data matrix element required for the operation of the tensor core load command may be generated based on the position in the matrix of the element required for the operation, that is, the array index 700.

예를 들어, 도 3을 참조하면, 프로세서(110)는 입력 받은 텐서 코어 로드 데이터(500)에 포함된 제1 목적 레지스터 주소(501) 'r4'에 대응하는 성분의 엘레멘트 아이디(141)가 '2'인 것으로 생성할 수 있다.For example, referring to FIG. 3, the processor 110 sets the element ID 141 of the component corresponding to the first destination register address 501 'r4' included in the input tensor core load data 500 to 'r4'. It can be created as '2'.

또한, 프로세서(110)는 입력 받은 텐서 코어 로드 데이터(500)에 포함된 제1 목적 레지스터 주소(501) 'r3'에 대응하는 성분의 엘레멘트 아이디(141)가 '10'인 것으로 생성할 수 있다.In addition, the processor 110 may generate the element ID 141 of the component corresponding to the first destination register address 501 'r3' included in the input tensor core load data 500 as '10'. .

또한, 프로세서(110)는 입력 받은 텐서 코어 로드 데이터(500)에 포함된 제1 목적 레지스터 주소(501) 'r8'에 대응하는 성분의 엘레멘트 아이디(141)가 '6'인 것으로 생성할 수 있다.In addition, the processor 110 may generate the element ID 141 of the component corresponding to the first destination register address 501 'r8' included in the input tensor core load data 500 as '6'. .

그리고, 프로세서(110)는 입력 데이터 행렬 성분의 엘레멘트 아이디(141)에 기초하여 제2 목적 레지스터 주소(142)를 결정할 수 있다. 구체적으로, 프로세서(110)는 동일한 엘레멘트 아이디(141)를 가지는 성분들의 제2 목적 레지스터 주소(142)가 같은 값을 가지도록 제2 목적 레지스터 주소(142)를 결정할 수 있다. Additionally, the processor 110 may determine the second destination register address 142 based on the element ID 141 of the input data matrix component. Specifically, the processor 110 may determine the second destination register address 142 so that the second destination register addresses 142 of components having the same element ID 141 have the same value.

그리고, 프로세서(110)는 결정된 제2 목적 레지스터 주소(142)가 해당 성분들의 제1 목적 레지스터 주소(501)와 대응되도록 레지스터 매핑 테이블(121)을 갱신할 수 있다.Additionally, the processor 110 may update the register mapping table 121 so that the determined second destination register address 142 corresponds to the first destination register address 501 of the corresponding components.

예를 들어, 도3을 참조하면, 입력 받은 텐서 코어 로드 데이터(500)의 제1 목적 레지스터 주소(501)에 대응하는 엘레멘트 아이디(141)가 '2'이고, 이에 대응하는 제2 목적 레지스터 주소(142)가 'p2'인 것으로 결정되면, 프로세서(110)는 제2 목적 레지스터 주소(142) 'p2'가 제1 목적 레지스터 주소(501) 'r4'에 대응되는 것으로 레지스터 매핑 테이블(121)을 갱신할 수 있다. For example, referring to Figure 3, the element ID 141 corresponding to the first destination register address 501 of the input tensor core load data 500 is '2', and the corresponding second destination register address is '2'. If (142) is determined to be 'p2', the processor 110 determines that the second destination register address 142 'p2' corresponds to the first destination register address 501 'r4' in the register mapping table 121. can be updated.

또한, 입력 받은 텐서 코어 로드 데이터(500) 의 제1 목적 레지스터 주소(501)에 대응하는 엘레멘트 아이디(141)가 '6'이고, 이에 대응하는 제2 목적 레지스터 주소(142)가 'p6'인 것으로 결정되면, 프로세서(110)는 제2 목적 레지스터 주소(142) 'p6'가 제1 목적 레지스터 주소(501) 'r8'에 대응되는 것으로 레지스터 매핑 테이블(121)을 갱신할 수 있다.In addition, the element ID 141 corresponding to the first destination register address 501 of the input tensor core load data 500 is '6', and the corresponding second destination register address 142 is 'p6'. If it is determined that the second destination register address 142 'p6' corresponds to the first destination register address 501 'r8', the processor 110 may update the register mapping table 121.

현재 입력 받은 텐서 코어 로드 명령에서 이용되는 입력 데이터 행렬의 성분에 대응하는 엘레멘트 아이디(141)와 동일한 엘레멘트 아이디(141)를 나타내는 텐서 코어 로드 명령이 이전에 있었을 수 있다. There may have previously been a tensor core load command indicating the same element ID 141 as the element ID 141 corresponding to the component of the input data matrix used in the currently received tensor core load command.

이 경우, 프로세서(110)는 현재 입력 받은 텐서 코어 로드 명령의 성분의 제1 목적 레지스터 주소(501)에 대응되는 제2 목적 레지스터 주소(142)가 이전의 텐서 코어 로드 명령의 성분의 제1 목적 레지스터 주소(501)에 대응되는 제2 목적 레지스터 주소(142)와 동일하도록 레지스터 매핑 테이블(121)을 갱신할 수 있다.In this case, the processor 110 determines that the second destination register address 142 corresponding to the first destination register address 501 of the component of the currently input tensor core load instruction is the first target of the component of the previous tensor core load instruction. The register mapping table 121 may be updated to be the same as the second destination register address 142 corresponding to the register address 501.

이에 따라, 프로세서(110)는 이전에 있었던 텐서 코어 로드 명령에서 이용했던 레지스터와 동일한 레지스터를 이용하여 현재 입력 받은 텐서 코어 로드 명령의 연산을 수행할 수 있다.Accordingly, the processor 110 can perform the operation of the currently input tensor core load command using the same register as the register used in the previous tensor core load command.

예를 들어, 도3을 참조하면, 세 번째로 입력 받은 텐서 코어 로드 명령의 경우, 엘레멘트 아이디(141)가 '2'이며, 이는 첫 번째로 입력 받은 텐서 코어 로드 명령의 엘레멘트 아이디(141)인 '2'와 동일하다. 따라서, 프로세서(110)는 세 번째로 입력 받은 텐서 코어 로드 명령에 포함된 제1 목적 레지스터 주소 'r3'에 대응되는 제2 목적 레지스터 주소(142)가 'p2'인 것으로 갱신할 수 있다.For example, referring to Figure 3, in the case of the third input tensor core load command, the element ID 141 is '2', which is the element ID 141 of the first tensor core load command input. Same as '2'. Accordingly, the processor 110 may update the second destination register address 142 corresponding to the first destination register address 'r3' included in the third input tensor core load command to 'p2'.

이에 따라, 프로세서(110)는 세 번째로 입력 받은 텐서 코어 로드 명령에 대한 연산에 이용되는 데이터를 로드하기 위하여 굳이 메모리(120)에 접근할 필요없이, 첫 번째 텐서 코어 로드 명령에 대한 연산에 이용되었던 제2 목적 레지스터 주소(142)가 'p2'인 레지스터(150)에 저장된 데이터를 이용하여 행렬 연산을 수행할 수 있게 된다.Accordingly, the processor 110 does not need to access the memory 120 to load the data used for the calculation of the third input tensor core load command, but uses it for the calculation of the first tensor core load command. Matrix operations can be performed using data stored in the register 150 whose second destination register address 142 is 'p2'.

도 4는 일 실시예에 따른 아이디 생성기 및 로드 기록 버퍼의 구성을 나타낸 도면이며, 도 5는 일 실시예에 따라 제2 목적 레지스터 주소가 로드 기록 버퍼에 기록되어 있는지 식별하는 것을 설명하기 위한 도면이다.FIG. 4 is a diagram showing the configuration of an ID generator and a load write buffer according to an embodiment, and FIG. 5 is a diagram illustrating identifying whether a second destination register address is recorded in the load write buffer according to an embodiment. .

도 4를 참조하면, 합성곱 연산 장치(100)는 범용 그래픽스 처리장치에 데이터의 중복성을 확인할 수 있는 감지 유닛(detection unit)을 포함할 수 있다. 또한, 프로세서(110)는 이러한 감지 유닛을 포함할 수 있다.Referring to FIG. 4, the convolution operation unit 100 may include a detection unit capable of checking data redundancy in a general-purpose graphics processing unit. Additionally, processor 110 may include such a sensing unit.

감지 유닛은 앞서 프로세서(110)가 동일한 메모리 데이터에 이미 접근하였는지 확인하고, 해당 데이터가 레지스터 파일의 어느 위치에 저장되어 있는지 기록할 수 있다. 감지 유닛은 데이터 중복성을 확인할 수 있는 아이디 생성기(ID generator)(130)와 메모리 로드 인스트럭션의 행적을 추적하기 위한 로드 기록 버퍼(load history buffer)(140)로 구성될 수 있다.The detection unit may check whether the processor 110 has already accessed the same memory data and record where the data is stored in the register file. The detection unit may be comprised of an ID generator 130 that can check data redundancy and a load history buffer 140 that tracks the traces of memory load instructions.

아이디 생성기(130)는 현재 행렬 연산에 필요한 입력 데이터 행렬의 성분의 엘레멘트 아이디(141)를 생성할 수 있다. 그리고, 로드 기록 버퍼(140)에는 생성된 엘레멘트 아이디(141) 및 이에 대응되는 제2 목적 레지스터 주소(142)가 기록될 수 있다.The ID generator 130 may generate an element ID 141 of a component of the input data matrix required for the current matrix operation. Additionally, the generated element ID 141 and the corresponding second destination register address 142 may be recorded in the load write buffer 140.

도 5를 참조하면, 감지 유닛, 즉 프로세서(110)는 현재 행렬 연산에 필요한 입력 데이터 행렬의 성분의 엘레멘트 아이디(141) 및 이에 대응되는 제2 목적 레지스터 주소(142)가 로드 기록 버퍼(140)에 기록되어 있는지 식별할 수 있다.Referring to FIG. 5, the detection unit, that is, the processor 110, stores the element ID 141 of the component of the input data matrix required for the current matrix operation and the corresponding second destination register address 142 in the load write buffer 140. It can be identified if it is recorded in .

프로세서(110)는 현재 행렬 연산에 필요한 입력 데이터 행렬의 성분의 엘레멘트 아이디(141) 및 이에 대응되는 제2 목적 레지스터 주소(142)가 로드 기록 버퍼(140)에 기록되어 있는 경우, 해당 제2 목적 레지스터 주소(142)가 해당 성분의 제1 목적 레지스터 주소(501)에 대응되도록 레지스터 매핑 테이블(121)을 갱신할 수 있다. If the element ID 141 of the element of the input data matrix required for the current matrix operation and the corresponding second purpose register address 142 are recorded in the load write buffer 140, the processor 110 performs the corresponding second purpose register address 140. The register mapping table 121 may be updated so that the register address 142 corresponds to the first destination register address 501 of the corresponding component.

이에 따라, 프로세서(110)는 레지스터 매핑 테이블(121)에 기초하여 현재 행렬 연산에 필요한 입력 데이터 행렬의 성분의 데이터를 로드하기 위하여, 메모리(120)에 접근하지 않고 로드 기록 버퍼(140)에 기록된 제2 목적 레지스터 주소(142)를 가지는 레지스터(150)에 저장된 데이터를 재사용할 수 있다.Accordingly, the processor 110 writes the data to the load write buffer 140 without accessing the memory 120 in order to load the data of the elements of the input data matrix required for the current matrix operation based on the register mapping table 121. Data stored in the register 150 having the second destination register address 142 can be reused.

예를 들어, 도5의 (a), (b), (c)를 참조하면, 현재 연산에 필요한 입력 데이터 성분의 엘레멘트 아이디(141)가 '2'이고, 이에 대응되는 제2 목적 레지스터 주소(142)는 'p2'일 수 있다. 이때, 엘레멘트 아이디 '2' 및 제2 목적 레지스터 주소(142) 'p2'가 로드 기록 버퍼(140)에 기록되어 있는 경우, 프로세서(110)는 메모리(120)에 접근하지 않고, 로드 기록 버퍼(140)에 기록된 제2 목적 레지스터 주소 'p2'를 가지는 레지스터(150)에 저장된 데이터를 재사용할 수 있다.For example, referring to (a), (b), and (c) of Figure 5, the element ID 141 of the input data component required for the current operation is '2', and the corresponding second destination register address ( 142) may be 'p2'. At this time, when the element ID '2' and the second destination register address 142 'p2' are written in the load write buffer 140, the processor 110 does not access the memory 120, but accesses the load write buffer ( Data stored in the register 150 having the second destination register address 'p2' recorded in 140) can be reused.

반면, 프로세서(110)는 현재 연산에 필요한 입력 데이터 행렬의 성분의 엘레멘트 아이디(141) 및 이에 대응되는 제2 목적 레지스터 주소(142)가 로드 기록 버퍼(140)에 기록되어 있지 않은 경우, 메모리(120)에 접근하여 메모리 계층으로부터 현재 연산에 필요한 입력 데이터 행렬의 성분의 데이터를 패치(fetch)할 수 있다.On the other hand, the processor 110, if the element ID 141 of the component of the input data matrix required for the current operation and the corresponding second destination register address 142 are not recorded in the load write buffer 140, memory ( 120), the data of the components of the input data matrix required for the current operation can be fetched from the memory layer.

그리고, 프로세서(110)는 해당 엘레멘트 아이디(141) 및 해당 데이터가 패치된 레지스터(150)의 제2 목적 레지스터 주소를 로드 기록 버퍼(140)에 기록하고, 기록된 제2 목적 레지스터 주소(142)가 현재 연산에 필요한 입력 데이터 행렬의 성분의 제1 목적 레지스터 주소(501)와 대응되도록 레지스터 매핑 테이블(121)를 갱신할 수 있다.Then, the processor 110 records the corresponding element ID 141 and the second destination register address of the register 150 where the corresponding data is fetched into the load write buffer 140, and the written second destination register address 142 The register mapping table 121 may be updated to correspond to the first destination register address 501 of the component of the input data matrix required for the current operation.

예를 들어, 도5의 (d)를 참조하면, 현재 연산에 필요한 입력 데이터 성분의 엘레멘트 아이디(141)가 '2'이고, 해당 성분의 제1 목적 레지스터 주소(501)는 'r4'일 수 있다. 이때, 엘레멘트 아이디(141) '2'와 제2 목적 레지스터 주소(142)인 'p2'가 로드 기록 버퍼(140)에 기록되어 있지 않은 경우, 메모리(120)에 접근하여 메모리 계층으로부터 제1 목적 레지스터 주소 'r4'의 데이터를 패치할 수 있다(L1$).For example, referring to (d) of Figure 5, the element ID 141 of the input data component required for the current operation is '2', and the first destination register address 501 of the component may be 'r4'. there is. At this time, if the element ID 141 '2' and the second destination register address 142 'p2' are not recorded in the load write buffer 140, the memory 120 is accessed to obtain the first destination from the memory layer. Data at register address 'r4' can be fetched (L1$).

그리고, 프로세서(110)는 엘레멘트 아이디 '2'및 패치된 데이터가 저장된 레지스터(150)의 제2 목적 레지스터 주소(142) 'p2'를 로드 기록 버퍼(140)에 기록할 수 있다.Additionally, the processor 110 may record the element ID '2' and the second destination register address 142 'p2' of the register 150 where the fetched data is stored in the load write buffer 140 .

도 6은 일 실시예에 따라 원본 입력 데이터 행렬을 필터 행렬과 합성곱 연산을 수행하는 것을 설명하기 위한 도면이다.FIG. 6 is a diagram illustrating performing a convolution operation on an original input data matrix and a filter matrix according to an embodiment.

도6을 참조하면, 원본 입력 데이터 행렬(201)이 4×4 크기의 행렬이고, 필터 행렬(300)이 3×3 크기의 행렬인 합성곱 연산 과정의 예시를 확인할 수 있다.Referring to Figure 6, an example of a convolution operation process can be seen in which the original input data matrix 201 is a 4×4 matrix and the filter matrix 300 is a 3×3 matrix.

이때, 한번의 합성곱 연산 과정에서 4번의 행렬 연산이 진행되고, 2×2 크기의 출력 데이터 행렬(400)이 출력됨을 확인할 수 있다. 즉, 한번의 합성곱 연산을 위해서 행렬곱 연산이 여러 번 발생함을 알 수 있다. 따라서 연산 회수를 줄이기 위하여 한번의 행렬곱 연산을 통해 출력 데이터 행렬(400)을 출력할 수 있도록 원본 입력 데이터 행렬(201)을 변환하는 것이 필요할 수 있다.At this time, it can be seen that four matrix operations are performed during one convolution operation, and an output data matrix 400 of 2×2 size is output. In other words, it can be seen that the matrix multiplication operation occurs multiple times for one convolution operation. Therefore, in order to reduce the number of operations, it may be necessary to convert the original input data matrix 201 so that the output data matrix 400 can be output through a single matrix multiplication operation.

출력 데이터 행렬(400)은 원본 입력 데이터 행렬(201)을 입력값으로 했을 때의 합성곱 연산으로 인한 결과값일 수 있다. 즉, 특징 데이터 행렬(401)은 출력 데이터 행렬(400)의 성분이 순서대로 나열되어 열이 1개가 되도록 구성된 행렬일 수 있다.The output data matrix 400 may be a result of a convolution operation when the original input data matrix 201 is used as an input value. That is, the feature data matrix 401 may be a matrix in which the components of the output data matrix 400 are arranged in order to have one column.

도 7은 일 실시예에 따라 워크스페이스로 변환된 입력 데이터 행렬을 나타낸 도면이다.Figure 7 is a diagram showing an input data matrix converted to a workspace according to an embodiment.

도 7을 참조하면, 프로세서(110)는 합성곱을 GEMM 연산으로 수행하기 위하여 원본 입력 데이터 행렬(201)의 크기, 성분의 개수 및 성분의 순서를 워크스페이스에 해당하는 메모리 영역으로 변경하여 워크스페이스로 변환된 입력 데이터 행렬(202)을 생성할 수 있다. Referring to FIG. 7, in order to perform convolution using the GEMM operation, the processor 110 changes the size, number of components, and order of components of the original input data matrix 201 to a memory area corresponding to the workspace and stores the data in the workspace. A converted input data matrix 202 can be generated.

이 경우, 프로세서(110)는 원본 입력 데이터 행렬(201)을 특징 데이터 행렬(401)의 행의 개수와 동일한 행의 개수를 가지고, 필터 행렬(300)의 크기와 동일한 열의 개수를 가지는 워크스페이스로 변환하여 입력 데이터 행렬(202)을 생성할 수 있다.In this case, the processor 110 transforms the original input data matrix 201 into a workspace having the same number of rows as the number of rows of the feature data matrix 401 and the same number of columns as the size of the filter matrix 300. The input data matrix 202 can be generated by conversion.

예를 들어, 특징 데이터 행렬(401)의 행의 개수가 4이고, 필터 행렬(300)의 크기가 9이면, 워크스페이스로 변환된 입력 데이터 행렬(202)은 행의 개수가 4이고, 열의 개수가 9일 수 있다.For example, if the number of rows of the feature data matrix 401 is 4 and the size of the filter matrix 300 is 9, the input data matrix 202 converted to workspace has the number of rows and the number of columns. could be 9.

이에 따라, 프로세서(110)는 필터 행렬(300)과의 한번의 행렬 연산으로 특징 데이터 행렬(401)을 출력할 수 있도록 원본 입력 데이터 행렬(201)을 워크스페이스로 변환된 입력 데이터 행렬(202)로 변환할 수 있다.Accordingly, the processor 110 converts the original input data matrix 201 into a workspace to output the feature data matrix 401 through a single matrix operation with the filter matrix 300. It can be converted to .

도 8은 일 실시예에 따른 입력 데이터 행렬의 성분들이 규칙을 가지고 재조합되어 배열된 행렬임을 나타내는 도면이다.FIG. 8 is a diagram illustrating a matrix in which components of an input data matrix according to an embodiment are reassembled and arranged according to a rule.

도8을 참조하면, 워크스페이스로 변환된 입력 데이터 행렬(202)은, 메모리 영역에서 필터 행렬(300)과의 한번의 행렬 연산으로 특징 데이터 행렬(401)을 출력할 수 있도록, 원본 입력 데이터 행렬(201)의 성분들이 규칙을 가지고 재조합되어 배열된 행렬일 수 있다.Referring to Figure 8, the input data matrix 202 converted to the workspace is the original input data matrix so that the feature data matrix 401 can be output through a single matrix operation with the filter matrix 300 in the memory area. It may be a matrix in which the components of (201) are recombined and arranged according to rules.

구체적으로, 도8의 워크스페이스로 변환된 입력 데이터 행렬(202)을 참조하면, 행렬에서의 1행 2열의 성분과 2행 1열의 성분이 서로 중복된 중복 성분이고, 1행 3열의 성분과 2행 2열의 성분이 서로 중복된 중복 성분일 수 있다.Specifically, referring to the input data matrix 202 converted to the workspace of Figure 8, the component in row 1, column 2 and the component in row 2, column 1 of the matrix are duplicate components that overlap each other, and the component in row 1, column 3 and 2 The components in row 2 and column 2 may be duplicate components that overlap each other.

또한, 1행 8열의 성분, 2행 7열의 성분, 3행 5열의 성분 및 4행 4열의 성분이 서로 중복된 중복 성분일 수 있다.Additionally, the component in the 1st row and 8 columns, the component in the 2nd row and 7 columns, the component in the 3rd row and 5 columns, and the component in the 4th row and 4 columns may be duplicate components that overlap with each other.

이처럼 워크스페이스로 변환된 입력 데이터 행렬(202)은 중복 성분들이 특정 규칙을 가지고 배열된 행렬이므로, 해당 규칙을 이용한다면 중복된 성분들에 대하여 동일한 엘레멘트 아이디(141)를 생성하는 것이 가능할 수 있다.Since the input data matrix 202 converted to a workspace in this way is a matrix in which duplicate components are arranged according to a specific rule, it may be possible to generate the same element ID 141 for the duplicate components by using the rule.

다시 말해, 비록 워크스페이스로 변환된 입력 데이터 행렬(202) 상에서는 다른 위치에 위치한 서로 다른 성분이더라도, 원본 입력 데이터 행렬(201)에서 본래 같은 성분이었다면 해당 중복 성분들은 워크스페이스로 변환된 입력 데이터 행렬(202) 상에서 특정 규칙을 가지고 배열될 것이다. 이때, 해당 중복 성분들에 동일한 엘레멘트 아이디(141)를 생성하고, 해당 중복 성분들에 저장된 데이터를 동일한 레지스터(150)에서 불러올 수 있다면 연산량을 줄이고, 에너지를 절약하는 것이 가능할 수 있다.In other words, even if they are different components located at different positions in the input data matrix 202 converted to the workspace, if they were originally the same component in the original input data matrix 201, the duplicate components are the input data matrix converted to the workspace ( 202) will be arranged with specific rules. At this time, if the same element ID 141 is generated for the duplicate components and the data stored in the duplicate components can be loaded from the same register 150, it may be possible to reduce the amount of calculation and save energy.

도 9는 일 실시예에 따른 입력 데이터 행렬의 성분들이 규칙을 가지고 배열된 행렬임을 나타내는 또 다른 도면이다.FIG. 9 is another diagram illustrating a matrix in which components of an input data matrix according to an embodiment are arranged according to a rule.

패치(Patch)는 워크스페이스로 변환된 입력 데이터 행렬(202)의 부분 행렬일 수 있다.A patch may be a partial matrix of the input data matrix 202 converted to a workspace.

도9를 참조하면, 워크스페이스로 변환된 입력 데이터 행렬(202)은 6개의 2×3크기의 행렬인 패치들로 구성될 수 있다. 이때, Referring to Figure 9, the input data matrix 202 converted to a workspace may be composed of six 2×3 matrix patches. At this time,

하나의 패치는 원본 입력데이터 행렬(201)의 어느 한 행의 구성 성분들이 재배열된 것일 수 있다.One patch may be a rearrangement of components of one row of the original input data matrix 201.

예를 들어, 원본 입력 데이터 행렬(201)의 1행의 성분 데이터가 순서대로 '3', '1,' '4', '-2'이면 첫 번째 패치의 성분 데이터는 '3', '1', '4', '1', '4', '-2'일 수 있다.For example, if the component data of row 1 of the original input data matrix 201 is '3', '1,' '4', and '-2' in that order, the component data of the first patch is '3', '1' ', '4', '1', '4', or '-2'.

또한, 원본 입력 데이터 행렬(201)의 3행의 성분 데이터가 순서대로 '4', '-2', '4', '0'이면 세 번째 패치의 성분 데이터는 '4', '-2', '4', '-2', '4', '0'일 수 있다.Additionally, if the component data of row 3 of the original input data matrix 201 is '4', '-2', '4', and '0' in that order, the component data of the third patch are '4', '-2'. , '4', '-2', '4', or '0'.

한편, 원본 입력 데이터 행렬(201)의 3행의 성분은 세번째 패치의 성분에만 나타나는 것은 아닐 수 있다. 도면을 참조하면, 다섯 번째 패치의 성분에도 원본 입력 데이터 행렬(201)의 3행의 성분이 나타나는 것을 확인할 수 있다.Meanwhile, the component of the third row of the original input data matrix 201 may not appear only in the component of the third patch. Referring to the drawing, it can be seen that the components of row 3 of the original input data matrix 201 also appear in the components of the fifth patch.

마찬가지로, 도면을 참조하면, 두 번째 패치의 성분과 네 번째 패치의 성분이 서로 대응되고, 이는 원본 입력 데이터 행렬(201)의 2행의 성분들이 재배열된 것임을 알 수 있다.Likewise, referring to the drawing, it can be seen that the components of the second patch and the components of the fourth patch correspond to each other, and this is a rearrangement of the components of the second row of the original input data matrix 201.

이처럼, 워크스페이스로 변환된 입력 데이터 행렬(202)은 필터 행렬(300)과의 한번의 행렬 연산으로 특징 데이터 행렬(401)을 출력할 수 있도록, 복수개의 패치들이 규칙을 가지고 배열된 행렬일 수 있다.In this way, the input data matrix 202 converted to the workspace may be a matrix in which a plurality of patches are arranged with a rule so that the feature data matrix 401 can be output through a single matrix operation with the filter matrix 300. there is.

따라서, 프로세서(110)는 이와 같은 규칙에 기초하여 입력 데이터 행렬(202)의 복수의 성분들 중 서로 중복되는 데이터를 나타내는 중복 성분들의 제1 목적 레지스터 주소들이 동일한 제2 목적 레지스터 주소에 대응하도록 레지스터 매핑 테이블(121)을 갱신할 수 있다.Therefore, based on this rule, the processor 110 registers the first destination register addresses of the plurality of elements of the input data matrix 202 that represent overlapping data so that they correspond to the same second destination register address. The mapping table 121 can be updated.

구체적으로, 프로세서(110)는 워크스페이스로 변환된 입력 데이터 행렬(202)의 복수의 성분들의 식별자를 생성할 수 있다. 그리고, 복수의 성분들 중 동일한 식별자가 생성된 제1 목적 레지스터 주소들이 동일한 제2 목적 레지스터 주소에 대응하도록 레지스터 매핑 테이블(121)을 갱신할 수 있다.Specifically, the processor 110 may generate identifiers of a plurality of components of the input data matrix 202 converted to a workspace. Additionally, the register mapping table 121 may be updated so that first destination register addresses for which the same identifier is generated among the plurality of components correspond to the same second destination register address.

이를 위해, 아이디 생성기(130)는 입력 데이터 행렬(202)의 복수의 성분들의 어레이 인덱스, 필터 행렬의 행의 개수 및 열의 개수 및 출력 데이터 행렬의 열의 개수에 기초하여 입력 데이터 행렬(202)의 복수의 성분들의 패치 아이디를 생성할 수 있다.To this end, the ID generator 130 generates a plurality of elements of the input data matrix 202 based on the array index of the plurality of elements of the input data matrix 202, the number of rows and columns of the filter matrix, and the number of columns of the output data matrix. You can create patch IDs for the components.

그리고, 아이디 생성기(130)는 패치 아이디 및 입력 데이터 행렬(202)의 성분의 오프셋에 기초하여 복수의 성분들의 엘레멘트 아이디를 생성할 수 있다.Additionally, the ID generator 130 may generate element IDs of a plurality of components based on the patch ID and the offset of the components of the input data matrix 202.

여기에서, 패치 아이디는, 상기 입력 데이터 행렬의 열의 개수 및 채널의 개수에 기초하여 결정되는 값일 수 있다.Here, the patch ID may be a value determined based on the number of columns and channels of the input data matrix.

그리고, 어레이 인덱스는, 상기 복수의 성분들이 단일 차원 어레이로 배열되었을 때 각 성분들의 위치를 나타내는 값일 수 있다.Additionally, the array index may be a value indicating the position of each component when the plurality of components are arranged in a single-dimensional array.

이하에서는, 아이디 생성기(130)가 입력 데이터 행렬(202)의 복수의 성분들에 대하여 엘레멘트 아이디(141)를 생성하는 방법을 자세히 설명한다.Below, a method by which the ID generator 130 generates the element ID 141 for a plurality of components of the input data matrix 202 will be described in detail.

도 10은 일 실시예에 따라 생성된 입력 데이터 행렬(202)의 복수의 성분들의 패치 아이디를 설명하기 위한 도면이다. 여기에서, 필터 행렬(300)과 출력 데이터 행렬(400)은 도 6과 같은 것으로 가정하겠다.FIG. 10 is a diagram illustrating patch IDs of a plurality of components of the input data matrix 202 generated according to an embodiment. Here, it will be assumed that the filter matrix 300 and the output data matrix 400 are the same as those in FIG. 6.

아이디 생성기(130)는 입력 데이터 행렬(202)의 복수의 성분들의 어레이 인덱스, 필터 행렬(300)의 행의 개수 및 열의 개수 및 출력 데이터 행렬(400)의 열의 개수에 기초하여 입력 데이터 행렬(202)의 복수의 성분들의 패치 아이디(600)를 생성할 수 있다.The ID generator 130 generates the input data matrix 202 based on the array index of the plurality of elements of the input data matrix 202, the number of rows and columns of the filter matrix 300, and the number of columns of the output data matrix 400. ) can generate a patch ID 600 of a plurality of components.

이를 위해, 아이디 생성기(130)는 입력 데이터 행렬(202)의 복수의 성분들의 어레이 인덱스를 할당할 수 있다. To this end, the ID generator 130 may allocate array indices to a plurality of elements of the input data matrix 202.

예를 들어, 도 10a와 같이, 입력 데이터 행렬(202)의 1행 1열에서 1행 9열까지 성분들에 대하여 0 에서 8의 어레이 인덱스 값이 순차적으로 할당되고, 2행 1열부터 2행 9열까지 성분들에 대하여 9 에서 17의 어레이 인덱스 값이 순차적으로 할당되고, 3행 1열부터 9열까지 성분들에 대하여 18 에서 26의 어레이 인덱스 값이 순차적으로 할당되고, 4행 1열부터 9열까지 성분들에 대하여 27 에서 35의 어레이 인덱스 값이 순차적으로 할당될 수 있다.For example, as shown in Figure 10a, array index values of 0 to 8 are sequentially assigned to the components from row 1, column 1 to column 1, column 9, and from row 2, column 1 to row 2 of the input data matrix 202. Array index values from 9 to 17 are sequentially assigned to components up to column 9, array index values from 18 to 26 are sequentially assigned to components from row 3, column 1 to column 9, and from row 4, column 1. Array index values from 27 to 35 can be sequentially assigned to components up to 9 columns.

그리고, 아이디 생성기(130)는 입력 데이터 행렬(202)의 복수의 성분들의 행 요소(1010) 및 열 요소(1020)를 산출할 수 있다.Additionally, the ID generator 130 may calculate row elements 1010 and column elements 1020 of a plurality of components of the input data matrix 202.

입력 데이터 행렬(202)의 성분의 행 요소(1010)는 해당 성분이 위치한 어레이 인덱스 값을 필터 행렬(300)의 크기(필터 행렬(300)의 열의 개수 x 필터 행렬(300)의 행의 개수)로 나누었을 때 출력되는 몫일 수 있다.The row element 1010 of the component of the input data matrix 202 is the size of the filter matrix 300 (the number of columns of the filter matrix 300 x the number of rows of the filter matrix 300) with the array index value where the corresponding component is located. This may be the quotient output when divided by .

예를 들어, 도 10a와 같이, 입력 데이터 행렬(202)의 2행 3열에 위치한 성분의 어레이 인덱스 값은 11일 수 있다. 또한, 4행 5열에 위치한 성분의 어레이 인덱스 값은31일 수 있다. For example, as shown in FIG. 10A, the array index value of the component located in the 2nd row and 3rd column of the input data matrix 202 may be 11. Additionally, the array index value of the component located in row 4 and column 5 may be 31.

이 때, 필터 행렬(300)의 크기는 9이므로, 2행 3열에 위치한 성분의 행 요소는 11을 9로 나누었을 때 출력되는 몫인 1일 수 있으며, 4행 5열에 위치한 성분의 행 요소(1010)는 31을 9로 나누었을 때 출력되는 몫인 3일 수 있다.At this time, the size of the filter matrix 300 is 9, so the row element of the component located in the 2nd row and 3rd column may be 1, which is the quotient output when 11 is divided by 9, and the row element of the component located in the 4th row and 5th column (1010 ) may be 3, which is the quotient output when 31 is divided by 9.

입력 데이터 행렬(202)의 성분의 열 요소(1020)는 해당 성분이 위치한 어레이 인덱스 값을 필터 행렬(300)의 크기(필터 행렬(300)의 열의 개수 x 필터 행렬(300)의 행의 개수)로 나누었을 때 출력되는 나머지일 수 있다.The column element 1020 of the component of the input data matrix 202 is the size of the filter matrix 300 (the number of columns of the filter matrix 300 x the number of rows of the filter matrix 300) with the array index value where the corresponding component is located. This may be the remainder output when divided by .

예를 들어, 입력 데이터 행렬(202)의 2행 3열에 위치한 성분의 어레이 인덱스 값은 11일 수 있다. 또한, 4행 5열에 위치한 성분의 어레이 인덱스 값은31일 수 있다. For example, the array index value of the component located in the 2nd row and 3rd column of the input data matrix 202 may be 11. Additionally, the array index value of the component located in row 4 and column 5 may be 31.

이 때, 필터 행렬(300)의 크기는 9이므로, 2행 3열에 위치한 성분의 열 요소는 11을 9로 나누었을 때 출력되는 나머지인 2일 수 있으며, 4행 5열에 위치한 성분의 열 요소(1020)는 31을 9로 나누었을 때 출력되는 나머지인 4일 수 있다.At this time, since the size of the filter matrix 300 is 9, the column element of the component located in row 2, column 3 may be 2, which is the remainder output when 11 is divided by 9, and the column element of the component located in row 4, column 5 ( 1020) may be 4, which is the remainder output when 31 is divided by 9.

그리고, 아이디 생성기(130)는 입력 데이터 행렬(202)의 성분마다 제1 패치 요소 및 제2 패치 요소를 산출할 수 있다.Additionally, the ID generator 130 may calculate a first patch element and a second patch element for each element of the input data matrix 202.

제1 패치 요소는 입력 데이터 행렬(202)의 성분의 행 요소를 출력 데이터 행렬(400)의 열의 개수로 나누었을 때 출력되는 몫일 수 있다.The first patch element may be a quotient output when the row element of the input data matrix 202 is divided by the number of columns of the output data matrix 400.

예를 들어, 도 10a와 같이, 입력 데이터 행렬(202)에서 2행 3열에 위치한 성분의 행 요소는 1일 수 있다. 또한, 4행 5열에 위치한 성분의 행 요소는 3일 수 있다.For example, as shown in FIG. 10A, the row element of the component located in the 2nd row and 3rd column of the input data matrix 202 may be 1. Additionally, the row element of the component located in row 4 and column 5 may be 3.

이때, 출력 데이터 행렬(400)의 열의 개수는 2이므로, 2행 3열에 위치한 성분의 제1 패치 요소는 1을 2로 나누었을 때 출력되는 몫인 0일 수 있으며, 4행 5열에 위치한 성분의 제1 패치 요소는 3을 2로 나누었을 때 출력되는 몫인 1일 수 있다.At this time, since the number of columns of the output data matrix 400 is 2, the first patch element of the component located in the 2nd row and 3rd column may be 0, which is the quotient output when 1 is divided by 2, and the first patch element of the component located in the 4th row and 5th column may be 0. The 1 patch element may be 1, which is the quotient output when 3 is divided by 2.

제2 패치 요소는 입력 데이터 행렬(202)의 성분의 열 요소를 필터 행렬(300)의 열의 개수로 나누었을 때 출력되는 몫일 수 있다.The second patch element may be a quotient output when the column elements of the input data matrix 202 are divided by the number of columns of the filter matrix 300.

예를 들어, 도 10a의 입력 데이터 행렬(202)에서 2행 3열에 위치한 성분의 열 요소는 2일 수 있다. 또한, 4행 5열에 위치한 성분의 열 요소는 4일 수 있다.For example, in the input data matrix 202 of FIG. 10A, the column element of the component located in the 2nd row and 3rd column may be 2. Additionally, the column element of the component located in row 4 and column 5 may be 4.

이때, 필터 행렬(300)의 열의 개수는 3이므로, 2행 3열에 위치한 성분의 제2 패치 요소는 2를 3으로 나누었 때 출력되는 몫인 0일 수 있다. 4행 5열에 위치한 성분의 제2 패치 요소는 4를 3으로 나누었을 때 출력되는 몫인 1일 수 있다.At this time, since the number of columns of the filter matrix 300 is 3, the second patch element of the component located in the 2nd row and 3rd column may be 0, which is the quotient output when 2 is divided by 3. The second patch element of the component located in row 4 and column 5 may be 1, which is the quotient output when 4 is divided by 3.

그리고, 아이디 생성기(130)는 제1 패치 요소에 필터 행렬(300)의 스트라이드를 곱한 값에 제2 패치 요소를 더하여 패치 아이디(600)를 생성할 수 있다.Then, the ID generator 130 may generate the patch ID 600 by adding the second patch element to the value obtained by multiplying the first patch element by the stride of the filter matrix 300.

예를 들어, 필터 행렬(300)의 스트라이드가 1인 것으로 가정하면, 도 10b와 같이, 입력 데이터 행렬(202)에서 2행 3열에 위치한 성분의 패치 아이디(600)는 0에 1을 곱한 값에 0을 더한 값인 0일 수 있다. 또한, 4행 5열에 위치한 성분의 패치 아이디(600)는 1에 1을 곱한 값에 1을 더한 값인 2일 수 있다. 한편, 전술한 방식은 입력 데이터 행렬(202)의 어느 성분에 대하여 패치 아이디(600)를 구할 수 있는 하나의 예시일 뿐이며, 전혀 다른 방식으로 패치 아이디(600)가 생성되더라도 이러한 패치 아이디(600)를 이용하여 중복 성분들에 대하여 동일한 엘레멘트 아이디(141)를 생성할 수 있다면 문제없다.For example, assuming that the stride of the filter matrix 300 is 1, as shown in Figure 10b, the patch ID 600 of the component located in the 2nd row and 3rd column of the input data matrix 202 is the value of 0 multiplied by 1. It can be 0, which is the value of adding 0. Additionally, the patch ID 600 of the component located in row 4, column 5 may be 2, which is the value of 1 multiplied by 1 plus 1. Meanwhile, the above-described method is only an example of how to obtain the patch ID 600 for any component of the input data matrix 202, and even if the patch ID 600 is generated in a completely different way, this patch ID 600 There is no problem if you can create the same element ID (141) for duplicate elements using .

도 11은 일 실시 예에 따라 생성된 입력 데이터 행렬(202)의 복수의 성분들의 엘레멘트 아이디를 설명하기 위한 도면이다. 여기에서, 원본 데이터 행렬(201), 필터 행렬(300) 및 출력 데이터 행렬(400)은 도 6과 같고, 패치 아이디(600)는 도 10과 같이 생성된 것으로 가정한다.FIG. 11 is a diagram for explaining element IDs of a plurality of components of the input data matrix 202 generated according to an embodiment. Here, it is assumed that the original data matrix 201, the filter matrix 300, and the output data matrix 400 are as shown in FIG. 6, and the patch ID 600 is created as shown in FIG. 10.

아이디 생성기(130)는 패치 아이디(600)에 기초하여, 중복된 데이터를 나타내는 입력 데이터 행렬(202)의 성분들에 동일한 엘레멘트 아이디(141)를 생성할 수 있다.The ID generator 130 may generate an element ID 141 that is identical to the components of the input data matrix 202 representing duplicate data, based on the patch ID 600.

구체적으로, 아이디 생성기(130)는 입력 데이터 행렬(202)의 성분의 오프셋에 기초하여, 중복된 데이터를 나타내는 입력 데이터 행렬(202)의 성분들에 동일한 엘레멘트 아이디(141)를 생성할 수 있다.Specifically, the ID generator 130 may generate an element ID 141 that is the same as the components of the input data matrix 202 representing duplicate data, based on the offset of the components of the input data matrix 202.

여기에서, 입력 데이터 행렬(202)의 성분의 오프셋은 입력 데이터 행렬(202)의 성분의 패치 아이디(600)에 원본 입력 데이터 행렬(201)의 열의 개수 및 채널의 개수를 곱한 값일 수 있다.Here, the offset of the component of the input data matrix 202 may be a value obtained by multiplying the patch ID 600 of the component of the input data matrix 202 by the number of columns and the number of channels of the original input data matrix 201.

예를 들어, 원본 입력 데이터 행렬(201)의 채널의 개수가 1인 것으로 가정하면, 도 10b와 같이, 입력 데이터 행렬(202)에서, 2행 3열에 위치한 성분의 패치 아이디(600)가 0이므로, 2행 3열에 위치한 성분의 오프셋은 0에 입력 데이터 행렬(200)의 열의 개수인 4 및 채널의 개수 1을 곱한 값인 0일 수 있다.For example, assuming that the number of channels in the original input data matrix 201 is 1, as shown in Figure 10b, in the input data matrix 202, the patch ID 600 of the component located in the 2nd row and 3rd column is 0. , the offset of the component located in the 2nd row and 3rd column may be 0, which is the product of 0 multiplied by 4, the number of columns of the input data matrix 200, and 1, the number of channels.

또한, 입력 데이터 행렬(202)에서, 4행 5열에 위치한 성분의 패치 아이디(600)가 2이므로, 4행 5열에 위치한 성분의 오프셋은 2에 입력 데이터 행렬(200)의 열의 개수인 4 및 채널의 개수 1을 곱한 값인 8일 수 있다.Additionally, in the input data matrix 202, since the patch ID 600 of the component located in the 4th row and 5th column is 2, the offset of the component located in the 4th row and 5th column is 2, 4, which is the number of columns of the input data matrix 200, and the channel It may be 8, which is the value multiplied by the number 1.

그리고, 아이디 생성기(130)는 입력 데이터 행렬(202)의 성분의 오프셋에 입력 데이터 행렬(202)의 성분의 행 요소(1010)를 출력 데이터 행렬(400)의 열의 개수에 채널의 개수 및 스트라이드를 곱한 값으로 나누었을 때 출력되는 나머지 값과, 입력 데이터 행렬(202)의 성분의 열 요소(1020)를 필터 행렬(300)의 열의 개수에 채널의 개수를 곱한 값으로 나누었을 때 출력되는 나머지 값을 더하여 입력 데이터 행렬(202)의 성분의 엘레멘트 아이디(141)를 생성할 수 있다.Then, the ID generator 130 divides the row elements 1010 of the components of the input data matrix 202 into the offsets of the components of the input data matrix 202, the number of channels, and the stride of the number of columns of the output data matrix 400. The remaining value output when divided by the multiplied value, and the remaining value output when the column element 1020 of the component of the input data matrix 202 is divided by the number of columns of the filter matrix 300 multiplied by the number of channels. The element ID 141 of the component of the input data matrix 202 can be generated by adding .

예를 들어, 도 11을 참조하면, 행렬(300)의 스트라이드 및 원본 입력 데이터 행렬(201)의 채널의 개수가 1인 것으로 가정하면, 2행 3열의 성분의 행 요소 1을 출력 데이터 행렬(400)의 열의 개수 2에 채널의 개수 1 및 스트라이드 1을 곱한 값인 2로 나누었을 때 출력되는 나머지 값은 1일 수 있다.For example, referring to FIG. 11, assuming that the stride of the matrix 300 and the number of channels of the original input data matrix 201 are 1, row element 1 of the component in the 2nd row and 3rd column is the output data matrix 400. ), the remaining value output when divided by 2, which is a value obtained by multiplying the number of columns (2) by the number of channels (1) and the stride (1), may be 1.

그리고, 2행 3열의 성분의 열 요소 2를, 필터 행렬(300)의 열의 개수 3에 채널의 개수 1을 곱한 값인 3으로 나누었을 때 출력되는 나머지 값은 2일 수 있다.And, when the column element 2 of the component in the 2nd row and 3rd column is divided by 3, which is the value obtained by multiplying the number of columns of the filter matrix 300 by 3 by the number of channels, 1, the remaining value output may be 2.

이때, 2행 3열의 성분의 오프셋은 0이므로, 아이디 생성기(130)는 2행 3열의 성분에 대하여, 0에 1과 2를 더한 '3'으로 엘레멘트 아이디(141)를 생성할 수 있다.At this time, since the offset of the component in the 2nd row and 3rd column is 0, the ID generator 130 can generate the element ID 141 as '3' by adding 1 and 2 to 0 for the component in the 2nd row and 3rd column.

또한, 도 11을 참조하면, 4행 5열의 성분의 행 요소 3을, 출력 데이터 행렬(400)의 열의 개수 2에 채널의 개수 1 및 스트라이드 1을 곱한 값인 2로 나누었을 때 출력되는 나머지 값은 1일 수 있다.Also, referring to FIG. 11, when the row element 3 of the 4th row, 5th column component is divided by 2, which is the number of columns of the output data matrix 400, 2, multiplied by the number of channels, 1, and the stride of 1, the remaining value output is It can be 1.

그리고, 4행 5열의 성분의 열 요소 4를, 필터 행렬(300)의 열의 개수 3에 채널의 개수 1을 곱한 값인 3으로 나누었을 때 출력되는 나머지 값은 1일 수 있다.And, when column element 4 of the component in the 4th row and 5th column is divided by 3, which is the number of columns of the filter matrix 300 (3) multiplied by the number of channels (1), the remaining value output may be 1.

이때, 4행 5열의 성분의 오프셋은 8이므로, 아이디 생성기(130)는 4행 5열의 성분에 대하여, 8에 1과 1을 더한 '10'으로 엘레멘트 아이디(141)를 생성할 수 있다.At this time, since the offset of the component in the 4th row and 5th column is 8, the ID generator 130 can generate the element ID 141 as '10' by adding 1 and 1 to 8 for the component in the 4th row and 5th column.

그리고, 프로세서(110)는 입력 데이터 행렬(202)의 복수의 성분들 중 동일한 엘레멘트 아이디가 생성된 성분들의 제1 목적 레지스터 주소들이 동일한 제2 목적 레지스터 주소에 대응하도록 레지스터 매핑 테이블(121)을 갱신할 수 있다.And, the processor 110 updates the register mapping table 121 so that the first target register addresses of the components for which the same element ID is generated among the plurality of components of the input data matrix 202 correspond to the same second target register address. can do.

이 경우, 아이디 생성기(130)에 의해 생성된 복수의 성분들의 엘레멘트 아이디(141)는 중복 성분들이 동일한 값을 가지게 된다는 점에서, 결과적으로, 중복 성분들의 제1 목적 레지스터 주소들이 동일한 제2 목적 레지스터 주소에 대응하도록 레지스터 매핑 테이블(121)이 갱신될 수 있다.In this case, the element ID 141 of the plurality of components generated by the ID generator 130 is such that the duplicate components have the same value, and as a result, the first destination register addresses of the duplicate components have the same second target register address. The register mapping table 121 may be updated to correspond to the address.

그리고, 프로세서(110)는 이와 같이 갱신되는 레지스터 매핑 테이블(121)에 기초하여 중복 성분들에 대해 동일한 제2 목적 레지스터 주소를 가지는 레지스터를 재사용하여 합성곱 연산을 수행할 수 있다.Additionally, the processor 110 may perform a convolution operation by reusing registers having the same second target register address for the redundant components based on the register mapping table 121 updated in this way.

한편, 전술한 실시 예는 단일 배치의 합성곱이 수행되는 경우에 프로세서(110)가 레지스터 매핑 테이블(121)을 갱신하는 방법이 설명되었으나, 본 개시의 일 실시 예에 따르면, 다중 배치의 인자를 가지는 합성곱이 수행되더라도 프로세서(110)는 중복 성분들의 제1 목적 레지스터 주소들이 동일한 제2 목적 레지스터 주소에 대응하도록 레지스터 매핑 테이블(121)을 갱신할 수 있다.Meanwhile, in the above-described embodiment, a method for the processor 110 to update the register mapping table 121 when a single batch of convolution is performed has been described. However, according to an embodiment of the present disclosure, a method with multiple batches of factors is described. Even if convolution is performed, the processor 110 may update the register mapping table 121 so that the first destination register addresses of the redundant components correspond to the same second destination register address.

이를 위해, 아이디 생성기(130)는 입력 데이터 행렬(202)의 복수의 성분들의 엘레멘트 아이디 및 배치 아이디를 생성할 수 있다.To this end, the ID generator 130 may generate element IDs and batch IDs of a plurality of components of the input data matrix 202.

그리고, 아이디 생성기(130)는 복수의 성분들 중 동일한 엘레멘트 아이디 및 배치 아이디가 생성된 성분들의 제1 목적 레지스터 주소들이 동일한 제2 목적 레지스터 주소에 대응하도록 레지스터 매핑 테이블(121)을 갱신할 수 있다.Additionally, the ID generator 130 may update the register mapping table 121 so that the first target register addresses of components for which the same element ID and batch ID are generated among the plurality of components correspond to the same second target register address. .

도 12는 또 다른 일 실시 예에 따른 아이디 생성기 및 로드 기록 버퍼의 구성을 나타낸 도면이다.Figure 12 is a diagram showing the configuration of an ID generator and a load record buffer according to another embodiment.

도 12를 참조하면, 아이디 생성기(130)는 현재 행렬 연산에 필요한 입력 데이터 행렬의 성분의 엘레멘트 아이디(1210) 및 배치 아이디(1220)를 생성할 수 있다. 그리고, 로드 기록 버퍼(140)에는 생성된 엘레멘트 아이디(1210) 및 배치 아이디(1220)에 대응되는 제2 목적 레지스터 주소(142)가 기록될 수 있다.Referring to FIG. 12, the ID generator 130 may generate an element ID 1210 and a batch ID 1220 of the components of the input data matrix required for the current matrix operation. Also, the second destination register address 142 corresponding to the generated element ID 1210 and batch ID 1220 may be recorded in the load write buffer 140.

그리고, 프로세서(110)는 현재 연산에 필요한 입력 데이터 행렬의 성분의 엘레멘트 아이디(1210) 및 배치 아이디(1220)와 이에 대응되는 제2 목적 레지스터 주소(501)가 로드 기록 버퍼(140)에 기록되어 있는지 식별할 수 있다. In addition, the processor 110 records the element ID 1210 and batch ID 1220 of the components of the input data matrix required for the current operation and the corresponding second destination register address 501 in the load write buffer 140. You can identify if there is.

프로세서(110)는 현재 연산에 필요한 입력 데이터 행렬의 성분의 엘레멘트 아이디(1210) 및 배치 아이디(1220)와 이에 대응되는 제2 목적 레지스터 주소(142)가 로드 기록 버퍼(140)에 기록되어 있는 경우, 해당 제2 목적 레지스터 주소(142)가 해당 성분의 제1 목적 레지스터 주소(501)에 대응되도록 레지스터 매핑 테이블(121)을 갱신할 수 있다. The processor 110 records the element ID 1210 and batch ID 1220 of the components of the input data matrix required for the current operation and the corresponding second destination register address 142 in the load write buffer 140. , the register mapping table 121 can be updated so that the second destination register address 142 corresponds to the first destination register address 501 of the corresponding component.

이에 따라, 프로세서(110)는 현재 행렬 연산에 필요한 입력 데이터 행렬의 성분의 데이터를 로드하기 위하여, 메모리(120)에 접근하지 않고 로드 기록 버퍼(140)에 기록된 제2 목적 레지스터 주소(142)를 가지는 레지스터(150)에 저장된 데이터를 재사용할 수 있다.Accordingly, in order to load the data of the elements of the input data matrix required for the current matrix operation, the processor 110 uses the second destination register address 142 written in the load write buffer 140 without accessing the memory 120. The data stored in the register 150 having can be reused.

반면, 프로세서(110)는 현재 연산에 필요한 입력 데이터 성분의 엘레멘트 아이디(1210) 및 배치 아이디(1220)와 이에 대응되는 제2 목적 레지스터 주소(142)가 로드 기록 버퍼(140)에 기록되어 있지 않은 경우, 메모리(120)에 접근하여 메모리 계층으로부터 제1 목적 레지스터 주소(501)의 성분의 데이터를 패치할 수 있다.On the other hand, the processor 110 determines that the element ID 1210 and batch ID 1220 of the input data component required for the current operation and the corresponding second destination register address 142 are not recorded in the load write buffer 140. In this case, the memory 120 can be accessed to fetch data of the first destination register address 501 from the memory layer.

그리고, 프로세서(110)는 해당 엘레멘트 아이디(1210) 및 배치 아이디(1220)와 해당 데이터가 패치된 레지스터(150)의 제2 목적 레지스터 주소(142)를 로드 기록 버퍼(140)에 기록하고, 기록된 제2 목적 레지스터 주소(142)가 해당 성분의 제1 목적 레지스터 주소(501)와 대응되도록 레지스터 매핑 테이블(121)을 갱신할 수 있다.Then, the processor 110 records the corresponding element ID 1210, batch ID 1220, and the second destination register address 142 of the register 150 where the corresponding data is fetched into the load write buffer 140, and records The register mapping table 121 may be updated so that the second destination register address 142 corresponds to the first destination register address 501 of the corresponding component.

이를 위해, 아이디 생성기(130)는 입력 데이터 행렬(202)의 복수의 성분들의 엘레멘트 아이디(1210) 및 배치 아이디(1220)를 생성할 수 있다. To this end, the ID generator 130 may generate element IDs 1210 and batch IDs 1220 of a plurality of components of the input data matrix 202.

여기에서, 아이디 생성기(130)가 입력 데이터 행렬(202)의 복수의 성분들의 어레이 인덱스를 할당하는 방법, 행 요소 및 열 요소를 산출하는 방법 및 엘레멘트 아이디(1210)를 생성하는 방법에 대한 설명은 도 6 내지 도 10에서 설명한 내용과 중복된다는 점에서, 생략하도록 한다.Here, a description of how the ID generator 130 allocates array indices of a plurality of elements of the input data matrix 202, how to calculate row elements and column elements, and how to generate the element ID 1210 is provided. Since it overlaps with the content described in FIGS. 6 to 10, it will be omitted.

아이디 생성기(130)는 입력 데이터 행렬(202)의 복수의 성분의 어레이 인덱스, 필터 행렬(300)의 행의 개수 및 열의 개수 및 출력 데이터 행렬(400)의 행의 개수 및 열의 개수에 기초하여 입력 데이터 행렬(202)의 복수의 성분들의 배치 아이디(1220)를 생성할 수 있다.The ID generator 130 inputs input based on the array index of the plurality of elements of the input data matrix 202, the number of rows and columns of the filter matrix 300, and the number of rows and columns of the output data matrix 400. A batch ID 1220 of a plurality of components of the data matrix 202 can be generated.

구체적으로, 아이디 생성기(130)는 입력 데이터 행렬(202)의 행 요소를 출력 데이터 행렬(400)의 크기(출력 데이터 행렬(400)의 열의 개수에 출력 데이터 행렬(400)의 행의 개수를 곱한 값)로 나누었을 때 출력되는 몫을 배치 아이디(1220)로 생성할 수 있다.Specifically, the ID generator 130 divides the row elements of the input data matrix 202 into the size of the output data matrix 400 (the number of columns of the output data matrix 400 multiplied by the number of rows of the output data matrix 400). The quotient output when divided by the value can be created as a batch ID (1220).

그리고, 프로세서(110)는 입력 데이터 행렬(202)의 복수의 성분들 중 동일한 엘레멘트 아이디가 생성된 성분들의 제1 목적 레지스터 주소(501)들이 동일한 제2 목적 레지스터 주소(142)에 대응하도록 레지스터 매핑 테이블(121)을 갱신할 수 있다.And, the processor 110 performs register mapping such that the first destination register addresses 501 of the plurality of components of the input data matrix 202 for which the same element ID is generated correspond to the same second destination register address 142. The table 121 can be updated.

이 경우, 아이디 생성기(130)에 의해 생성된 복수의 성분들의 엘레멘트 아이디(1210) 및 배치 아이디(1220)는 중복 성분들이 동일한 값을 가지게 된다는 점에서, 결과적으로, 중복 성분들의 제1 목적 레지스터 주소(501)들이 동일한 제2 목적 레지스터 주소(142)에 대응하도록 레지스터 매핑 테이블(121)이 갱신될 수 있다.In this case, the element ID 1210 and batch ID 1220 of the plurality of components generated by the ID generator 130 have the same value in that the duplicate components have the same value, and as a result, the first target register address of the duplicate components The register mapping table 121 may be updated so that 501 correspond to the same second destination register address 142.

그리고, 프로세서(110)는 이와 같이 갱신되는 레지스터 매핑 테이블(121)에 기초하여 중복 성분들에 대해 동일한 제2 목적 레지스터 주소(142)를 가지는 레지스터(150)를 재사용하여 합성곱 연산을 수행할 수 있다.And, the processor 110 can perform a convolution operation by reusing the register 150 having the same second destination register address 142 for the redundant components based on the register mapping table 121 updated in this way. there is.

이상에서 설명된 구성요소들의 성능에 대응하여 적어도 하나의 구성요소가 추가되거나 삭제될 수 있다. 또한, 구성요소들의 상호 위치는 시스템의 성능 또는 구조에 대응하여 변경될 수 있다는 것은 당해 기술 분야에서 통상의 지식을 가진 자에게 용이하게 이해될 것이다.At least one component may be added or deleted in response to the performance of the components described above. Additionally, it will be easily understood by those skilled in the art that the mutual positions of the components may be changed in response to the performance or structure of the system.

도 13은 일 실시예에 따른 합성곱 연산 방법의 순서도이다. 이는 본 발명의 목적을 달성하기 위한 바람직한 실시 예일 뿐이며, 필요에 따라 일부 구성이 추가되거나 삭제될 수 있음은 물론이다.Figure 13 is a flowchart of a convolution calculation method according to an embodiment. This is only a preferred embodiment for achieving the purpose of the present invention, and of course, some components may be added or deleted as needed.

도 13을 참조하면, 프로세서(110)는 원본 입력 데이터를 워크스페이스에 해당하는 메모리 영역으로 변경할 수 있다(1310).Referring to FIG. 13, the processor 110 may change the original input data to a memory area corresponding to the workspace (1310).

이때, 적어도 하나의 프로세서(110)는, 원본 입력 데이터 행렬의 크기, 성분의 개수 및 성분의 순서를 워크스페이스에 해당하는 메모리 영역으로 변경하여 입력 데이터 행렬을 생성할 수 있다.At this time, at least one processor 110 may generate the input data matrix by changing the size, number of components, and order of the components of the original input data matrix to a memory area corresponding to the workspace.

구체적으로, 프로세서(110)는 원본 입력 데이터 행렬을 특징 데이터 행렬의 행의 개수와 동일한 행의 개수를 가지고, 필터 행렬의 크기와 동일한 열의 개수를 가지는 워크스페이스로 변환하여 입력 데이터 행렬을 생성할 수 있다.Specifically, the processor 110 can generate an input data matrix by converting the original input data matrix into a workspace that has the same number of rows as the number of rows of the feature data matrix and the same number of columns as the size of the filter matrix. there is.

그리고, 프로세서(110)는 현재 연산에 필요한 입력 데이터 행렬의 성분의 제1 목적 레지스터 주소를 포함하는 텐서 코어 로드 데이터를 입력 받을 수 있다(1320).Additionally, the processor 110 may receive tensor core load data including the first destination register address of an element of the input data matrix required for the current operation (1320).

그리고, 프로세서(110)는 현재 연산에 필요한 입력 데이터 행렬의 성분의 식별자를 생성할 수 있다(1330). 이때, 식별자는 엘레멘트 아이디를 포함할 수 있다. 또한, 다중 배치 인자를 가지는 합성곱이 수행되는 경우, 식별자는 배치 아이디를 더 포함할 수 있다. Additionally, the processor 110 may generate identifiers of components of the input data matrix required for the current operation (1330). At this time, the identifier may include an element ID. Additionally, when convolution with multiple batch factors is performed, the identifier may further include a batch ID.

이를 위해, 프로세서(110)는 현재 연산에 필요한 입력 데이터 행렬의 성분의 어레이 인덱스, 필터 행렬의 행의 개수 및 열의 개수 및 출력 데이터 행렬의 열의 개수에 기초하여 입력 데이터 행렬의 복수의 성분들의 패치 아이디를 생성할 수 있다.To this end, the processor 110 generates patch IDs of a plurality of elements of the input data matrix based on the array index of the elements of the input data matrix required for the current operation, the number of rows and columns of the filter matrix, and the number of columns of the output data matrix. can be created.

그리고, 프로세서(110)는 패치 아이디 및 현재 연산에 필요한 입력 데이터 행렬의 성분의 오프셋에 기초하여 엘레멘트 아이디를 생성할 수 있다.Additionally, the processor 110 may generate an element ID based on the patch ID and the offset of the components of the input data matrix required for the current operation.

또한, 프로세서(110)는 현재 연산에 필요한 입력 데이터 행렬의 성분의 어레이 인덱스, 상기 필터 행렬의 행의 개수 및 열의 개수 및 출력 데이터 행렬의 행의 개수 및 열의 개수에 기초하여 배치 아이디를 생성할 수 있다.Additionally, the processor 110 may generate a batch ID based on the array index of the elements of the input data matrix required for the current operation, the number of rows and columns of the filter matrix, and the number of rows and number of columns of the output data matrix. there is.

그리고, 프로세서(110)는 생성된 식별자 및 이에 대응되는 제2 목적 레지스터 주소가 로드 기록 버퍼에 기록되어 있는지 식별할 수 있다(1340).Additionally, the processor 110 may identify whether the generated identifier and the second destination register address corresponding thereto are recorded in the load write buffer (1340).

그리고, 프로세서(110)는 생성된 식별자 및 이에 대응되는 제2 목적 레지스터 주소가 로드 기록 버퍼에 기록되어 있는 것으로 식별되면(S1340-Y), 기록된 제2 목적 레지스터 주소가 현재 연산에 필요한 입력 데이터 행렬의 성분의 제1 목적 레지스터 주소에 대응되도록 레지스터 매핑 테이블(121)을 갱신할 수 있다(1350).Then, when the processor 110 identifies that the generated identifier and the corresponding second destination register address are recorded in the load write buffer (S1340-Y), the written second destination register address is input data required for the current operation. The register mapping table 121 may be updated to correspond to the first destination register address of the matrix element (1350).

반면, 프로세서(110)는 생성된 식별자 및 이에 대응되는 제2 목적 레지스터 주소가 로드 기록 버퍼에 기록되어 있지 않은 것으로 식별되면(S1340-N), 메모리(120)에 접근하여 메모리 계층으로부터 현재 연산에 필요한 입력 데이터 행렬의 성분의 데이터를 패치할 수 있다(1360).On the other hand, when the processor 110 identifies that the generated identifier and the corresponding second destination register address are not recorded in the load write buffer (S1340-N), the processor 110 accesses the memory 120 and performs the current operation from the memory layer. Data of components of the required input data matrix can be fetched (1360).

그리고, 프로세서(110)는 생성된 식별자 및 해당 데이터가 패치된 레지스터의 제2 목적 레지스터 주소를 로드 기록 버퍼에 기록하고, 기록된 제2 목적 레지스터 주소가 현재 연산에 필요한 입력 데이터 행렬의 성분의 제1 목적 레지스터 주소에 대응되도록 레지스터 매핑 테이블을 갱신할 수 있다(1370). 그리고, 프로세서(110)는 레지스터 매핑 테이블(121)에 기초하여 제1 목적 레지스터 주소를 이에 대응되는 제2 목적 레지스터 주소로 변환할 수 있다(1380).Then, the processor 110 records the generated identifier and the second destination register address of the register into which the corresponding data was fetched into the load write buffer, and the recorded second destination register address is the first of the components of the input data matrix required for the current operation. 1 The register mapping table can be updated to correspond to the target register address (1370). Additionally, the processor 110 may convert the first target register address into the corresponding second target register address based on the register mapping table 121 (1380).

이에 따라, 프로세서(110)는 입력 데이터 행렬의 복수의 성분들 중 특정 제1 목적 레지스터의 성분의 데이터가 이용되는 연산을 수행하는 경우, 해당 제1 목적 레지스터 주소가 변환된 제2 목적 레지스터 주소를 가지는 레지스터(150)로부터 해당 데이터를 획득하여 연산을 수행할 수 있다.Accordingly, when the processor 110 performs an operation using data of a specific first target register component among a plurality of components of the input data matrix, the second target register address converted from the corresponding first target register address is used. Branches can obtain corresponding data from the register 150 and perform operations.

본 발명의 실시예에 따른 합성곱 연산 방법의 성능을 검증하기 위하여 텐서 코어 모델과 함께 GPGPU-sim을 이용한 연구를 진행했다. 이때, GPGPU-sim은 NVIDIA Titan V로 구성된 것을 이용하였다.In order to verify the performance of the convolution calculation method according to an embodiment of the present invention, a study was conducted using GPGPU-sim with a tensor core model. At this time, GPGPU-sim was used consisting of NVIDIA Titan V.

연구는 세가지의 대표적인 DNN(Deep Neural Network)에 의한 연산을 통해 이루어졌다. 구체적으로, 연구는 NVIDIA CUDA SDK 9.1의 cudaTensorCoreGemm 커널을 기반으로 구현되는 ResNet, GAN 및 YOLO에 의한 연산을 통하여 이루어졌다.The study was conducted through calculations using three representative Deep Neural Networks (DNNs). Specifically, the study was conducted through operations by ResNet, GAN, and YOLO, which are implemented based on the cudaTensorCoreGemm kernel of NVIDIA CUDA SDK 9.1.

도 14은 본 발명의 합성곱 연산 방법의 효과를 나타내는 그래프이다.Figure 14 is a graph showing the effect of the convolution calculation method of the present invention.

도14의 (a)를 참조하면, Oracle상태에서 본 발명의 실시예에 따른 합성곱 연산 방법은 기존에 비하여 26%의 속도 증가 효과를 확인할 수 있다.Referring to (a) of Figure 14, the convolution calculation method according to the embodiment of the present invention in the Oracle state can confirm a speed increase of 26% compared to the existing method.

Oracle상태는 로드 기록 버퍼(140)의 크기가 무한이라고 가정된 상태일 수 있다. 다만, 실제로는 로드 기록 버퍼(140)의 크기가 1024-entry인 옵션으로 연구가 진행되었다. 평균적으로, 1024-entry의 로드 기록 버퍼(140)는 Oracle상태 경우의 약 4/5인 22.1%의 성능 향상을 보여주었다.The Oracle state may be a state in which the size of the load record buffer 140 is assumed to be infinite. However, in reality, research was conducted on the option where the size of the load record buffer 140 is 1024-entry. On average, the 1024-entry load record buffer 140 showed a performance improvement of 22.1%, about 4/5 of the Oracle state case.

또한, 도14의 (b)를 참조하면, Oracle상태에서 본 발명의 실시예에 따른 합성곱 연산 방법은 기존에 비하여 중복되는 텐서 코어의 로드를 최대 76%까지 줄일 수 있는 것을 확인할 수 있다.In addition, referring to (b) of FIG. 14, it can be seen that the convolution calculation method according to the embodiment of the present invention in the Oracle state can reduce the load of overlapping tensor cores by up to 76% compared to the existing method.

이때 전체 메모리 로드 중 약 3/4의 메모리 로드의 경우, 레지스터 주소가 바뀌게 되었다.At this time, in the case of about 3/4 of all memory loads, the register address changed.

이처럼 본 발명의 합성곱 연산 방법에 대한 연구에서는 텐서 코어 로드를 적극적으로 제거함으로써 26%의 연산 속도 증가 효과 및 34%의 에너지 절약 효과를 달성하는 것을 확인하였다.In this way, in a study on the convolution calculation method of the present invention, it was confirmed that a 26% increase in calculation speed and a 34% energy saving effect were achieved by actively removing the tensor core load.

이상에서와 같이 첨부된 도면을 참조하여 개시된 실시예들을 설명하였다. 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고도, 개시된 실시예들과 다른 형태로 본 발명이 실시될 수 있음을 이해할 것이다. 개시된 실시예들은 예시적인 것이며, 한정적으로 해석되어서는 안 된다.As described above, the disclosed embodiments have been described with reference to the attached drawings. A person skilled in the art to which the present invention pertains will understand that the present invention can be practiced in forms different from the disclosed embodiments without changing the technical idea or essential features of the present invention. The disclosed embodiments are illustrative and should not be construed as limiting.

100: 합성곱 연산 장치
110: 프로세서
120: 메모리
121: 레지스터 매핑 테이블
130: 아이디 생성기
140: 로드 기록 버퍼
141: 엘레멘트 아이디
142: 제2 목적 레지스터 주소
150: 레지스터
201: 원본 입력 데이터 행렬
202: 워크스페이스로 변환된 입력 데이터 행렬
300: 필터 행렬
400: 출력 데이터 행렬
401: 특징 데이터 행렬
500: 텐서 코어 로드 데이터
501: 제1 목적 레지스터 주소
600: 패치 아이디
700: 어레이 인덱스100: convolution operation unit
110: processor
120: memory
121: Register mapping table
130: ID generator
140: load record buffer
141: Element ID
142: Second destination register address
150: register
201: Original input data matrix
202: Input data matrix converted to workspace
300: Filter matrix
400: Output data matrix
401: Feature data matrix
500: Tensor core load data
501: First destination register address
600: Patch ID
700: Array index

Claims

In the convolution operation method of generating a feature data matrix corresponding to the output data matrix by performing GEneral Matrix Multiplication of the input data matrix with a set filter matrix,
updating, by at least one processor, a register mapping table so that addresses of first target registers of redundant elements representing data that overlap each other among the plurality of elements of the input data matrix correspond to the same second target register address; and
Convolution operation method comprising: performing, by the at least one processor, a convolution operation by reusing a register having the same second destination register address for the redundant components based on the register mapping table; .

According to paragraph 1,
The step of updating the register mapping table is:
generating identifiers for the plurality of components; and
Convolution operation method including; updating the register mapping table so that first destination register addresses of components for which the same identifier is generated among the plurality of components correspond to the same second destination register address.

According to paragraph 2,
The identifier includes an element ID,
The step of generating the identifier is,
generating patch IDs of the plurality of components based on the array index of the plurality of components, the number of rows and columns of the filter matrix, and the number of columns of the output data matrix; and
Generating element IDs of the plurality of components based on the patch ID and the offsets of the plurality of components,
The offset is,
It is a value determined based on the patch ID, the number of columns of the input data matrix, and the number of channels,
The array index is a value indicating the position of each component when the plurality of components are arranged in a single-dimensional array.

According to paragraph 3,
The identifier further includes a batch ID,
The step of generating the identifier is,
Generating a batch ID of the plurality of components based on the array index of the plurality of components, the number of rows and columns of the filter matrix, and the number of rows and columns of the output data matrix; synthesizing, further comprising: Multiplication method.

According to paragraph 3,
The step of generating the patch ID is,
calculating a first patch element and a second patch element of the plurality of components; and
Generating the patch ID by adding the second patch element to the value obtained by multiplying the first patch element by the stride of the filter matrix,
The first patch element is,
It is a quotient output when the row elements of the plurality of components are divided by the number of columns of the output data matrix,
The second patch element is,
It is a quotient output when the column elements of the plurality of components are divided by the number of columns of the filter matrix,
The row element is a quotient output when the array index is divided by the size of the filter matrix,
The column element is a remainder output when the array index is divided by the size of the filter matrix.

According to clause 5,
The step of generating the element ID is,
The remaining value output when the offset is divided by the row element multiplied by the number of columns of the output data matrix, the number of channels, and the stride, and the column element is divided by the number of columns of the filter matrix and the number of channels. The element ID is created by adding the remaining value output when divided by the multiplied value,
The offset of the input data matrix is,
A convolution calculation method, where the patch ID is multiplied by the number of columns and channels of the input data matrix.

According to paragraph 1,
The method further includes generating the input data matrix by changing the size of the original input data matrix and the number and order of a plurality of components to a memory area corresponding to a workspace, by the at least one processor,
The input data matrix is,
A convolution operation method, wherein the original input data matrix is a matrix in which a plurality of components of the original input data matrix are recombined and arranged according to a rule so that the feature data matrix can be output through a filter matrix and GEMM operation.

In clause 7,
The step of generating the input data matrix is,
generating the input data matrix by converting the original input data matrix into a workspace having the same number of rows as the number of rows of the feature data matrix and the same number of columns as the size of the filter matrix; comprising: Convolution operation method.

According to paragraph 1,
The step of updating the register mapping table is:
Receiving tensor core load data;
identifying whether an identifier of a component having a first destination register address and a second destination register address included in the tensor core load data are recorded in a load write buffer;
If the identifier and the second destination register address are not written in the load write buffer, access memory to fetch data of the component from the memory layer, and set the second destination register in which the identifier and the fetched data are stored. writing a register address to the load write buffer and updating the register mapping table so that a second destination register address written in the load write buffer corresponds to the first destination register address; and
When the identifier and the second destination register address are written in the load write buffer, the register is mapped so that the second destination register address written in the load write buffer corresponds to the first destination register address without accessing memory. A convolution operation method comprising: updating a table.

A computer program stored in a computer-readable recording medium to execute the convolution calculation method of any one of claims 1 to 9.

In the convolution operation unit,
Memory where the register mapping table is stored; and
A processor that performs a convolution operation to generate a feature data matrix corresponding to the output data matrix by multiplying the input data matrix with a set filter matrix and a general matrix multiplication operation,
The processor,
updating the register mapping table so that first destination register addresses of redundant elements representing overlapping data among the plurality of elements of the input data matrix correspond to the same second destination register address,
A convolution operation device that performs a convolution operation by reusing a register having the same second destination register address for the redundant components based on the register mapping table.

According to clause 11,
The processor,
Generating identifiers for the plurality of components,
A convolution operation device that updates the register mapping table so that first destination register addresses of components for which the same identifier is generated among the plurality of components correspond to the same second destination register address.

According to clause 12,
The identifier includes an element ID,
The processor,
Generating patch IDs of the plurality of components based on the array index of the plurality of components, the number of rows and columns of the filter matrix, and the number of columns of the output data matrix,
Generating element IDs of the plurality of components based on the patch ID and offsets of the plurality of components,
The offset is,
It is a value determined based on the patch ID, the number of columns of the input data matrix, and the number of channels,
The array index is a value indicating the position of each component when the plurality of components are arranged in a single-dimensional array.

According to clause 13,
The identifier further includes a batch ID,
The processor,
A convolution operation device that generates a batch ID of the plurality of components based on the array index of the plurality of components, the number of rows and columns of the filter matrix, and the number of rows and columns of the output data matrix.

According to clause 13,
The processor,
Calculating a first patch element and a second patch element of the plurality of components,
Generating the patch ID by adding the second patch element to the value obtained by multiplying the first patch element by the stride of the filter matrix,
The first patch element is,
It is a quotient output when the row elements of the plurality of components are divided by the number of columns of the output data matrix,
The second patch element is,
It is a quotient output when the column elements of the plurality of components are divided by the number of columns of the filter matrix,
The row element is a quotient output when the array index is divided by the size of the filter matrix,
The column element is a remainder output when the array index is divided by the size of the filter matrix.

According to clause 15,
The processor,
The remaining value output when the offset is divided by the row element multiplied by the number of columns of the output data matrix, the number of channels, and the stride, and the column element is divided by the number of columns of the filter matrix and the number of channels. The element ID is created by adding the remaining value output when divided by the multiplied value,
The offset of the input data matrix is,
A convolution operation device, which is a value obtained by multiplying the patch ID by the number of columns and the number of channels of the input data matrix.

According to clause 11,
The processor,
Generate the input data matrix by changing the size of the original input data matrix and the number and order of a plurality of components to a memory area corresponding to the workspace,
The input data matrix is,
A convolution operation device, wherein the original input data matrix is a matrix in which a plurality of components of the original input data matrix are recombined and arranged according to a rule so that the feature data matrix can be output through a filter matrix and GEMM operation.

According to clause 17,
The processor,
A convolution operation device that generates the input data matrix by converting the original input data matrix into a workspace with the same number of rows as the number of rows of the feature data matrix and the same number of columns as the size of the filter matrix.

According to clause 11,
The processor,
Receive tensor core load data as input,
Identify whether the identifier of the component having the first destination register address and the second destination register address included in the tensor core load data are recorded in the load write buffer,
If the identifier and the second destination register address are not written in the load write buffer, the data of the component is fetched, and the second destination register address of the register where the identifier and the fetched data are stored is stored in the load write buffer. and updating the register mapping table so that the second destination register address written in the load write buffer corresponds to the first destination register address,
When the identifier and the second destination register address are written in the load write buffer, updating the register mapping table so that the second destination register address written in the load write buffer corresponds to the first destination register address. Convolutional computing unit.