KR20210014897A

KR20210014897A - Matrix operator and matrix operation method for artificial neural network

Info

Publication number: KR20210014897A
Application number: KR1020190092932A
Authority: KR
Inventors: 정기석; 박상수
Original assignee: 한양대학교 산학협력단
Priority date: 2019-07-31
Filing date: 2019-07-31
Publication date: 2021-02-10
Also published as: WO2021020848A3; KR102372869B1; WO2021020848A2

Abstract

The present invention can provide a matrix operator and a matrix operation method for an artificial neural network. The matrix operator comprises: a first buffer for receiving and storing a first matrix which is a multiplicand matrix; a second buffer for receiving and storing a second matrix which is a multiplier matrix that is multiplied in the first matrix; and an operation unit, which receives a plurality of elements sequentially selected in column units from the first matrix, receives a plurality of elements sequentially selected in row units from the second matrix in correspondence to a column selected from the first matrix, multiplies each of the elements in the column selected from the first matrix by all elements in a row selected from the second matrix, and cumulatively adds multiplication operation results between the columns in the first matrix and the rows in the second matrix, having been sequentially selected, so as to acquire a result matrix which is a matrix multiplication operation result of the first matrix and the second matrix. According to the present invention, it is possible to increase operation efficiency and operation speed and reduce power consumption.

Description

Matrix operator and matrix operation method for artificial neural networks {MATRIX OPERATOR AND MATRIX OPERATION METHOD FOR ARTIFICIAL NEURAL NETWORK}

본 발명은 인공 신경망 모듈 및 이의 스케쥴링 방법에 관한 것으로, 고효율 연산 처리를 위한 인공 신경망 모듈 및 이의 스케쥴링 방법에 관한 것이다.The present invention relates to an artificial neural network module and a scheduling method thereof, and to an artificial neural network module for highly efficient computation and a scheduling method thereof.

최근 인간의 두뇌가 패턴을 인식하는 방법을 모사하여 두뇌와 비슷한 방식으로 여러 정보를 처리하도록 구성되는 인공 신경망(artificial neural network)이 다양한 분야에 적용되어 사용되고 있다.Recently, artificial neural networks, which are configured to process various information in a manner similar to that of the brain by simulating the way the human brain recognizes patterns, has been applied and used in various fields.

이러한 인공 신경망은 방대한 데이터를 바탕으로 학습을 필요로 하며, 이과정에서 대량의 덧셈 및 곱셈 연산을 수행해야 하며, 이에 인공 신경망을 위한 연산을 수행하는 칩 구조에서는 MAC 연산기(Multiply-accumulate operater)와 같은 다수의 연산 회로가 구비되어야 한다.Such artificial neural networks require learning based on vast amounts of data, and in this process, a large amount of addition and multiplication operations must be performed. Accordingly, in the chip structure that performs operations for artificial neural networks, the multiply-accumulate operater (MAC) and The same multiple operation circuits must be provided.

따라서 최근에는 인공 신경망의 딥러닝에 특화된 새로운 종류의 하드웨어 가속기 분야가 큰 주목을 받고 있다. 딥러닝 가속기는 사용 환경 및 목적에 따라 서로 다른 형태로 제시되었다. 일예로 성능을 중시하는 서버나 워크스테이션 등에는 GPU(Graphics Processing Unit)가 주로 사용되고, 저전력을 우선시 하는 스마트폰과 같은 엣지 디바이스에서는 FPGA(Field Programmable Gate Array) 또는 ASIC(application specific integrated circuit)을 이용하여 설계한 전용 하드웨어, 즉 NPU(Neural Processing Unit)가 주로 사용되고 있다.Therefore, in recent years, a new kind of hardware accelerator field specialized in deep learning of artificial neural networks is attracting great attention. Deep learning accelerators were presented in different forms depending on the usage environment and purpose. For example, GPUs (Graphics Processing Unit) are mainly used for servers or workstations that value performance, and field programmable gate arrays (FPGAs) or application specific integrated circuits (ASICs) are used for edge devices such as smartphones that prioritize low power. Dedicated hardware designed by doing so, that is, NPU (Neural Processing Unit) is mainly used.

그러나 현재까지 나온 많은 가속기들은 전용 하드웨어 특성상 다양한 인공신경망에서 사용하는 다양한 형태의 레이어(layer) 또는 텐서(tensor)에 대응할 유연성이 부족하다. 이러한 단점은 현재 매우 다양하게 사용되고 있는 딥러닝 어플리케이션 및 모델들을 대응하기 힘들다는 점에서 문제가 있다.However, many accelerators available to date lack the flexibility to cope with various types of layers or tensors used in various artificial neural networks due to the characteristics of dedicated hardware. This drawback has a problem in that it is difficult to cope with the deep learning applications and models currently used in a wide variety.

한편, 다수의 연산 장치를 가변적으로 사용하기 위해서는 제어 회로가 복잡해지며, 이에 인공 신경망의 연산 수행 과정에서 일부 연산 장치가 이용되지 않고 유휴 상태에 머물러 있는 경우가 발생하게 되어 비효율성이 유발되며, 불필요한 전력이 추가로 소모될 수 있다.On the other hand, in order to variably use a number of computing devices, the control circuit becomes complicated. Accordingly, some computing devices are not used and remain in an idle state in the process of performing the calculation of the artificial neural network, resulting in inefficiency. Additional power may be consumed.

한국 공개 특허 제10-2019-0055447호 (2019.05.23 공개)Korean Patent Publication No. 10-2019-0055447 (published on May 23, 2019)

본 발명의 목적은 곱셈연산과 덧셈 연산을 파이프라인 기법에 따라 병렬로 동시에 수행하여 연산 효율성을 높이고 전력 소모를 줄일 수 있는 행렬 연산기 및 행렬 연산 방법을 제공하는데 있다.An object of the present invention is to provide a matrix operator and a matrix operation method capable of increasing operation efficiency and reducing power consumption by simultaneously performing a multiplication operation and an addition operation in parallel according to a pipeline technique.

상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른 인공 신경망을 위한 행렬 연산기는 피승수 행렬인 제1 행렬을 인가받아 저장하는 제1 버퍼; 상기 제1 행렬에 곱해지는 승수 행렬인 제2 행렬을 인가받아 저장하는 제2 버퍼; 및 상기 제1 행렬에서 열 단위로 순차적 선택된 다수의 원소를 인가받고, 상기 제1 행렬에서 선택된 열에 대응하여 상기 제2 행렬에서 행 단위로 순차적으로 선택된 다수의 원소를 인가받으며, 상기 제1 행렬에서 선택된 열의 원소 각각을 상기 제2 행렬에서 선택된 행의 모든 원소와 곱셈 연산하고, 순차적 선택된 제1 행렬의 열 및 제2 행렬의 행 사이의 곱셈 연산 결과를 누적 가산하여 상기 제1 행렬과 상기 제2 행렬의 행렬 곱셈 연산 결과인 결과 행렬을 획득하는 연산부를 포함한다.A matrix operator for an artificial neural network according to an embodiment of the present invention for achieving the above object includes: a first buffer for receiving and storing a first matrix, which is a multiplicand matrix; A second buffer for receiving and storing a second matrix, which is a multiplier matrix multiplied by the first matrix; And receiving a plurality of elements sequentially selected in column units in the first matrix, and receiving a plurality of elements sequentially selected in row units in the second matrix corresponding to the columns selected in the first matrix, and in the first matrix The first matrix and the second matrix are multiplied by multiplying each element of the selected column with all the elements of the row selected in the second matrix, and the result of the multiplication operation between the columns of the first matrix and the rows of the second matrix that are sequentially selected. And an operation unit that obtains a result matrix that is a result of a matrix multiplication operation of the matrix.

상기 연산부는 각각 상기 제1 행렬에서 선택된 열의 원소 중 대응하는 하나의 원소와 상기 제2 행렬에서 선택된 행의 모든 원소를 인가받아 곱셈하고, 원소간 곱셈 결과를 이전 곱셈 결과의 누적 가산된 누적값에 가산하여 부분 누적 행렬의 행의 원소를 획득하는 다수의 연산 처리 레인을 포함할 수 있다.The operation unit receives and multiplies each of the elements of the column selected in the first matrix and all the elements in the row selected in the second matrix, and multiplies the result of multiplication between the elements to the accumulated value of the previous multiplication result. It may include a plurality of processing lanes for obtaining elements of a row of a partial accumulation matrix by adding.

상기 다수의 연산 처리 레인 각각은 원소간 곱셈 결과를 이전 곱셈 결과의 누적 가산된 누적값에 가산하는 동안, 기지정된 순차에 따라 다음 선택되는 상기 제1 행렬에서 열의 원소와 상기 제2 행렬에서 행의 모든 원소를 인가받아 곱셈 연산을 수행할 수 있다.While each of the plurality of processing lanes adds the result of the inter-element multiplication to the accumulated value of the previous multiplication result, the element of the column in the first matrix and the row in the second matrix selected next according to a predetermined sequence. All elements are applied and multiplication operation can be performed.

상기 다수의 연산 처리 레인 각각은 다수의 프로세스 소자를 포함하고, 상기 다수의 프로세스 소자 각각은 상기 제1 행렬의 선택된 열에서 대응하는 하나의 원소와 상기 제2 행렬에서 선택된 행의 다수의 원소 중 대응하는 하나의 원소를 인가받아 곱셈 연산하는 곱셈기; 상기 곱셈기에서 출력되는 곱셈 결과를 이전 인가된 원소의 곱셈 결과를 누적 가산한 누적값과 가산하여 누적값을 갱신하는 가산기; 및 상기 가산기에서 갱신된 누적값을 저장하는 누적 레지스터를 포함할 수 있다.Each of the plurality of processing lanes includes a plurality of process elements, and each of the plurality of process elements corresponds to one of a corresponding element in a selected column of the first matrix and a plurality of elements in a row selected in the second matrix. A multiplier for multiplying by receiving one element; An adder for updating the accumulated value by adding the multiplication result output from the multiplier to the accumulated value obtained by accumulating the multiplication result of the previously applied element; And an accumulation register that stores the accumulated value updated by the adder.

상기 제2 버퍼는 상기 제1 버퍼에서 상기 제1 행렬의 제i(여기서 i는 자연수) 열이 선택되면, 상기 제2 행렬의 제i 행을 선택할 수 있다.When the i-th column of the first matrix (where i is a natural number) is selected from the first buffer, the second buffer may select the i-th row of the second matrix.

상기 행렬 연산기는 인공 신경망의 다수의 레이어 중 적어도 하나의 레이어에 지정된 연산을 수행하기 위한 인공 신경망 모듈로 구현되고, 상기 제1 행렬은 상기 적어도 하나의 레이어로 인가되는 특징맵이고, 상기 제2 행렬은 상기 적어도 하나의 레이어에 기지정된 커널일 수 있다.The matrix calculator is implemented as an artificial neural network module for performing a designated operation on at least one layer among a plurality of layers of the artificial neural network, and the first matrix is a feature map applied to the at least one layer, and the second matrix May be a kernel defined in the at least one layer.

상기 목적을 달성하기 위한 본 발명의 다른 실시예에 따른 인공 신경망을 위한 행렬 연산 방법은 피승수 행렬인 제1 행렬과 상기 제1 행렬에 곱해지는 승수 행렬인 제2 행렬을 인가받아 저장하는 단계; 상기 제1 행렬에서 열 단위로 순차적 선택된 다수의 원소와, 상기 제1 행렬에서 선택된 열에 대응하여 상기 제2 행렬에서 행 단위로 순차적으로 선택된 다수의 원소를 인가받는 단계; 상기 제1 행렬에서 선택된 열의 원소 각각을 상기 제2 행렬에서 선택된 행의 모든 원소와 곱셈 연산하는 단계; 및 순차적 선택된 제1 행렬의 열 및 제2 행렬의 행 사이의 곱셈 연산 결과를 누적 가산하여 상기 제1 행렬과 상기 제2 행렬의 행렬 곱셈 연산 결과인 결과 행렬을 획득하는 단계를 포함한다.A method for calculating a matrix for an artificial neural network according to another embodiment of the present invention to achieve the above object includes: receiving and storing a first matrix as a multiplicand and a second matrix as a multiplier matrix multiplied by the first matrix; Receiving a plurality of elements sequentially selected in column units in the first matrix and a plurality of elements sequentially selected in row units in the second matrix corresponding to columns selected in the first matrix; Multiplying each element of the column selected in the first matrix with all the elements in the row selected in the second matrix; And obtaining a result matrix that is a result of a matrix multiplication operation of the first matrix and the second matrix by cumulatively adding a multiplication operation result between the columns of the first matrix and the rows of the second matrix that are sequentially selected.

따라서, 본 발명의 실시예에 따른 인공 신경망을 위한 행렬 연산기 및 행렬 연산 방법은 인공 신경망의 다수의 레이어에서 수행되는 행렬 곱셈 연산과 덧셈 연산을 병렬로 동시에 수행할 수 있도록 하여, 연산 효율성을 높이고 전력 소모를 줄일 수 있다. 특히 파이프 라인 기법에 따라 곱셈 연산이 수행되는 동안 덧셈 연산이 누산되도록 하여 연산 효율성을 극대화 할 수 있다.Accordingly, the matrix calculator and the matrix calculation method for an artificial neural network according to an embodiment of the present invention enable a matrix multiplication operation and an addition operation performed in multiple layers of the artificial neural network to be simultaneously performed in parallel, thereby increasing computation efficiency and power It can reduce consumption. In particular, it is possible to maximize operation efficiency by accumulating addition operations while multiplication operations are being performed according to the pipeline technique.

도 1은 인공 신경망의 일예에 대한 개괄적 구조를 나타낸다.
도 2는 일반적인 행렬의 곱셈 연산 알고리즘을 나타낸 도면이다.
도 3은 도 2의 행렬 곱셈 연산 알고리즘에서 요구되는 곱셈 연산 및 덧셈 연산의 횟수를 나타낸다.
도 4는 본 발명의 일 실시예에 따른 행렬 연산기의 개략적 구조를 나타낸다.
도 5는 도 4의 연산 처리 레인의 상세 구성을 나타낸다.
도 6은 도 5의 프로세스 소자의 상세 구성을 나타낸다.
도 7은 본 발명의 일 실시예에 따른 행렬의 곱셈 연산 알고리즘을 나타낸다.
도 8은 본 발명의 일 실시예에 따른 행렬 연산 방법을 나타낸다.1 shows a schematic structure of an example of an artificial neural network.
2 is a diagram showing a general matrix multiplication algorithm.
3 shows the number of times of multiplication and addition operations required in the matrix multiplication algorithm of FIG. 2.
4 shows a schematic structure of a matrix operator according to an embodiment of the present invention.
5 shows a detailed configuration of an operation processing lane of FIG. 4.
6 shows a detailed configuration of the process element of FIG. 5.
7 shows a matrix multiplication algorithm according to an embodiment of the present invention.
8 shows a method for calculating a matrix according to an embodiment of the present invention.

본 발명과 본 발명의 동작상의 이점 및 본 발명의 실시에 의하여 달성되는 목적을 충분히 이해하기 위해서는 본 발명의 바람직한 실시예를 예시하는 첨부 도면 및 첨부 도면에 기재된 내용을 참조하여야만 한다. In order to fully understand the present invention, the operational advantages of the present invention, and the objects achieved by the implementation of the present invention, reference should be made to the accompanying drawings illustrating preferred embodiments of the present invention and the contents described in the accompanying drawings.

이하, 첨부한 도면을 참조하여 본 발명의 바람직한 실시예를 설명함으로써, 본 발명을 상세히 설명한다. 그러나, 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며, 설명하는 실시예에 한정되는 것이 아니다. 그리고, 본 발명을 명확하게 설명하기 위하여 설명과 관계없는 부분은 생략되며, 도면의 동일한 참조부호는 동일한 부재임을 나타낸다. Hereinafter, the present invention will be described in detail by describing a preferred embodiment of the present invention with reference to the accompanying drawings. However, the present invention may be implemented in various different forms, and is not limited to the described embodiments. In addition, in order to clearly describe the present invention, parts irrelevant to the description are omitted, and the same reference numerals in the drawings indicate the same members.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라, 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 "...부", "...기", "모듈", "블록" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다. Throughout the specification, when a part "includes" a certain component, it means that other components may be further included, rather than excluding other components unless specifically stated to the contrary. In addition, terms such as "... unit", "... group", "module", and "block" described in the specification mean units that process at least one function or operation, which is hardware, software, or hardware. And software.

도 1은 인공 신경망의 일예에 대한 개괄적 구조를 나타낸다.1 shows a schematic structure of an example of an artificial neural network.

도 1은 인공 신경망(artificial neural network)의 대표적인 일예로서, 컨볼루션 신경망(Convolution Neural Network: 이하 CNN)을 도시하였다. 특히 컨볼루션 신경망에서도 광학적 문자 인식(Optical character reader)에 사용되는 인공 신경망으로, 우편물의 우편 번호 인식, 및 숫자 인식을 위해 개발된 대표적인 컨볼루션 신경망인 LeNet-5의 개괄적 구조를 나타낸다.1 is a representative example of an artificial neural network, and illustrates a convolution neural network (CNN). In particular, it is an artificial neural network used for optical character reader in convolutional neural networks, and shows the general structure of LeNet-5, a representative convolutional neural network developed for postal code recognition and number recognition.

LeNet-5는 일예로 32 X 32 크기의 입력 이미지(Input)를 인가받아, 컨볼루션 연산 및 서브 샘플링 연산을 반복적으로 수행하며 특징맵(feature map: f.map)을 추출하고, 특징맵에서 추출된 특징값을 기반으로 기지정된 클래스 중 가장 가능성이 큰 클래스에 대응하는 값을 선택하도록 구성된다.LeNet-5, for example, receives a 32 X 32 input image, repeatedly performs convolution and sub-sampling operations, extracts a feature map (f.map), and extracts it from the feature map. It is configured to select a value corresponding to the most probable class among the predefined classes based on the specified feature value.

LeNet-5는 숫자를 인식하기 위해 개발된 신경망이므로, LeNet-5는 일예로 특징값을 0 ~ 9 사이의 숫자로 분류하고, 0 ~ 9 사이의 숫자 중 하나를 결과값으로 선택할 수 있다.Since LeNet-5 is a neural network developed to recognize numbers, LeNet-5 classifies feature values into numbers between 0 and 9 as an example, and can select one of the numbers between 0 and 9 as a result value.

이하에서는 일예로 LeNet-5를 설명하지만, 다른 컨볼루션 신경망 및 인공 신경망 또한 기본적으로는 유사성을 갖고 있으며, 본 발명의 개념은 LeNet-5에 한정되지 않고, 다양한 인공 신경망에 적용될 수 있다.Hereinafter, LeNet-5 will be described as an example, but other convolutional neural networks and artificial neural networks also have basically similarities, and the concept of the present invention is not limited to LeNet-5 and can be applied to various artificial neural networks.

도 1에서 C는 컨볼루션 레이어(Convolution layer), S는 서브 샘플링 레이어(Sub-sampling layer), FC는 완전 연결 레이어(Fully-Connected layer)를 의미하며, C, S, FC 뒤의 숫자는 레이어 인덱스를 나타낸다. 즉 LeNet-5는 3 개의 컨볼루션 레이어(C1, C2, C3), 2 개의 서브 샘플링 레이어(S1, S2) 및 2 개의 완전 연결 레이어(FC1, FC2)를 포함한다.In FIG. 1, C denotes a convolution layer, S denotes a sub-sampling layer, FC denotes a fully-connected layer, and numbers after C, S, and FC denote layers. Represents the index. That is, LeNet-5 includes three convolutional layers (C1, C2, C3), two sub-sampling layers (S1, S2), and two fully connected layers (FC1, FC2).

도 1에 도시된 바와 같은 CNN의 경우, 다수의 컨볼루션 레이어(C1, C2, C3) 각각이 특징맵(f.map)(또는 입력 이미지(Input))을 인가받고, 인가된 특징맵(f.map)(또는 입력 이미지(Input))을 대응하는 컨볼루션 레이어(C1, C2, C3) 각각에 기지정된 커널과 컨볼루션 연산을 수행하게 된다. 그리고 컨볼루션 연산은 다수의 곱셈 및 덧셈 연산으로 구성된다. 또한 다수의 완전 연결 레이어(FC1, FC2)에서도 곱셈 연산이 수행된다.In the case of a CNN as shown in FIG. 1, each of a plurality of convolutional layers (C1, C2, C3) receives a feature map (f.map) (or an input image (Input)), and the applied feature map (f. .map) (or the input image (Input)) corresponding to each of the convolutional layers (C1, C2, C3) a predetermined kernel and a convolution operation is performed. And the convolution operation consists of a number of multiplication and addition operations. In addition, a multiplication operation is also performed in a plurality of fully connected layers FC1 and FC2.

즉 도 1에 도시된 CNN의 경우, 다수의 곱셈 연산과 덧셈 연산을 수행해야 한다. 그리고 CNN이 아닌 다른 인공 신경망의 경우에도 기본적으로 곱셈 연산과 덧셈 연산을 수행하도록 구성된다.That is, in the case of the CNN shown in FIG. 1, a number of multiplication and addition operations must be performed. In addition, in the case of artificial neural networks other than CNN, it is basically configured to perform multiplication and addition operations.

일반적으로 인공 신경망은 특징맵(f.map)(또는 입력 이미지(Input))과 커널을 im2col(image blocks into columns)과 같은 기지정된 알고리즘에 따라 행렬로 변환하여, 행렬 곱셈 및 행렬 덧셈 연산을 수행함으로써 연산 속도를 향상시키고 있다.In general, artificial neural networks perform matrix multiplication and matrix addition operations by converting a feature map (or input image) and a kernel into a matrix according to a predetermined algorithm such as im2col (image blocks into columns). By doing so, the calculation speed is improved.

행렬간 곱셈 연산은 잘 알려진 바와 같이, 각 행렬에서 지정된 원소들을 곱한 후, 곱해진 결과를 모두 가산하여 연산 결과를 획득한다. 이때 원소간 곱셈의 경우, 병렬로 수행되어 1회의 곱셈 연산이 수행되는 반해, 원소간 곱셈 결과를 가산하는 과정은 가산되어야 하는 값의 개수에 따라 다수 횟수로 반복 연산이 수행되어야 한다.As is well known, the inter-matrix multiplication operation obtains an operation result by multiplying specified elements in each matrix and then adding all the multiplied results. In this case, in the case of inter-element multiplication, while the multiplication operation is performed once in parallel, the process of adding the inter-element multiplication result must be repeated a plurality of times according to the number of values to be added.

도 2는 일반적인 행렬의 곱셈 연산 방법을 나타낸 도면이고, 도 3은 도 2의 행렬 곱셈 연산 방법에서 요구되는 곱셈 연산 및 덧셈 연산의 횟수를 나타낸다.FIG. 2 is a diagram illustrating a general matrix multiplication operation method, and FIG. 3 shows the number of multiplication operations and addition operations required in the matrix multiplication operation method of FIG. 2.

도 2에 도시된 바와 같이, 행렬의 곱셈 연산은 m × k 크기의 A 행렬의 각 행(#1, #2, #3, ..., #m)의 원소들과 k × n 크기의 B 행렬의 각 열(&1, &2, &3, ..., &n)의 원소들 중 서로 대응하는 원소들을 곱하고 곱해진 값들을 더하여, A 행렬과 B 행렬의 곱셈 결과인 C 행렬의 하나의 원소(#, &)를 획득한다.As shown in FIG. 2, the multiplication operation of the matrix is performed by the elements of each row (#1, #2, #3, ..., #m) of matrix A of size m × k and B of size k × n. One element of matrix C, which is the result of multiplying matrix A and matrix B, by multiplying the corresponding elements among the elements of each column of the matrix (&1, &2, &3, ..., &n) and adding the multiplied values (# , &).

여기서 A 행렬을 피승수 행렬(Multiplicand Matrix)이라 하고, B 행렬을 승수 행렬(Multiplier Matrix)이라 한다.Here, the matrix A is called a multiplicand matrix, and the matrix B is called a multiplier matrix.

도 3을 참조하여 C 행렬의 1행 1열의 원소(#1, &1)을 획득하기 위한 연산을 살펴보면, 일반적인 행렬의 연산 규칙에 따라 피승수 행렬인 A 행렬을 행단위로 리드하고, 승수 행렬인 B 행렬을 열 단위로 리드하여, 리드된 A 행렬의 행과 B 행렬의 열을 서로 곱하고 곱해진 결과를 모두 더한다.Referring to FIG. 3, an operation for obtaining the elements (#1, &1) of the 1st row and 1st column of the C matrix is read, the matrix A, which is the multiplicand, in row units, according to the general matrix operation rule, and the matrix B, which is the multiplier matrix. By reading in column units, the rows of the read matrix A and the columns of the matrix B are multiplied with each other, and all the multiplied results are added.

일예로 우선 A 행렬의 제 1행(#1)의 원소들과 B 행렬의 제 1열(&1)의 원소 각각을 서로 곱한다. 여기서는 일예로 k가 16인 것으로 가정하였으며, 이에 행렬 연산기가 16개의 프로세스 소자(Processing Element: 이하 PE)를 구비하고 있다면, A 행렬의 제 1행(#1)의 16개의 원소들과 B 행렬의 제 1열(&1)의 16개의 원소가 병렬로 동시에 곱셈 연산이 수행될 수 있다. 즉 A 행렬의 제 1행(#1)의 원소들과 B 행렬의 제 1열(&1)의 원소에 대한 곱셈은 1회의 병렬 연산만이 수행된다.For example, first, the elements in the first row (#1) of matrix A and the elements in the first column (&1) of matrix B are multiplied with each other. Here, as an example, it is assumed that k is 16, and if the matrix operator has 16 processing elements (PE), 16 elements in the first row (#1) of matrix A and matrix B are A multiplication operation may be performed on 16 elements in the first column <RTI ID=0.0>(&1)</RTI> in parallel. That is, the multiplication of the elements in the first row (#1) of matrix A and the elements in the first column (&1) of matrix B is performed only once in parallel.

그러나 이후 대응하는 원소간 곱셈으로 획득된 16개의 곱셈 값들의 합은 1회의 연산으로 계산되지 않는다. 일반적으로 프로세스 소자(PE) 와 같은 연산 소자는 2개의 입력을 인가받아 곱셈 또는 덧셈 연산을 수행하도록 구성된다. 따라서 도 3에 도시된 바와 같이, 16개의 곱셈값들을 2개씩 선택하여 우선 덧셈 연산을 수행하고, 덧셈 연산된 값에 대해 다시 2개씩 선택하여 반복적으로 덧셈 연산을 수행해야 한다. 이는 16개의 곱셈값들에 대해 4번의 덧셈 연산이 반복적으로 수행되어야 함을 의미하며, 결과적으로 원소간 곱셈 연산에 비해 곱셈 결과에 대한 덧셈 연산의 연산 시간이 매우 길다는 것을 의미한다.However, the sum of the 16 multiplication values obtained by multiplying the corresponding elements thereafter is not calculated in one operation. In general, an operation element such as a process element PE is configured to receive two inputs and perform a multiplication or addition operation. Therefore, as shown in FIG. 3, it is necessary to select 16 multiplication values by two and perform an addition operation first, and then to repeatedly perform an addition operation by selecting two each of the added values. This means that four addition operations must be repeatedly performed for 16 multiplication values, and as a result, the operation time of the addition operation for the multiplication result is very long compared to the inter-element multiplication operation.

또한 원소간 곱셈 결과를 가산해야 하므로, 원소간 곱셈과 덧셈 연산이 순차적으로 수행된다. 즉 곱셈 연산과 덧셈 연산이 개별적으로 수행되어야 한다. 따라서 도 3에 도시된 바와 같이, 16개의 원소간 곱셈에 대해 총 5회의 연산을 필요로 한다.In addition, since the result of inter-element multiplication must be added, inter-element multiplication and addition operations are performed sequentially. In other words, the multiplication operation and the addition operation must be performed separately. Therefore, as shown in FIG. 3, a total of five operations are required for multiplication between 16 elements.

그리고 이는 C 행렬의 하나의 원소(예를 들면, c_0,0)에 대한 값을 획득하기 위한 연산으로, 각각 16개의 프로세스 소자(PE)를 포함하는 연산 처리 레인의 개수가 m 개라고 가정하면, C 행렬 전체에 대해서는 A 행렬과 B 행렬의 크기에 비례하는 횟수만큼 수행되어야 하므로, 결과적으로 n × 5 회의 연산을 필요로 하게 된다.And this is an operation to obtain a value for one element of the C matrix (for example, c _0,0 ). Assuming that the number of processing lanes including 16 process elements (PE) is m , As for the entire matrix C, the number of times proportional to the size of matrix A and matrix B must be performed, resulting in n × 5 operations.

따라서 행렬의 연산 속도를 가속하기 위해서는 덧셈 연산에 소요되는 시간을 저감하는 것이 매우 중요하다.Therefore, it is very important to reduce the time required for the addition operation in order to accelerate the operation speed of the matrix.

도 4는 본 발명의 일 실시예에 따른 행렬 연산기의 개략적 구조를 나타내고, 도 5는 도 4의 연산 처리 레인의 상세 구성을 나타내며, 도 6은 도 5의 프로세스 소자의 상세 구성을 나타낸다.FIG. 4 shows a schematic structure of a matrix operator according to an embodiment of the present invention, FIG. 5 shows a detailed configuration of an operation processing lane of FIG. 4, and FIG. 6 shows a detailed configuration of a process element of FIG. 5.

도 4 내지 도 6을 참조하여, 본 실시예에 따른 행렬 연산기(100)를 설명하면, 행렬 연산기(100)는 연산 제어부(110), 제1 버퍼부(120), 제2 버퍼부(130) 및 연산부(140)를 포함하며, 상기한 바와 같이, 인공 신경망의 모듈로서 이용될 수 있다.Referring to FIGS. 4 to 6, the matrix operator 100 according to the present embodiment will be described. The matrix operator 100 includes an operation control unit 110, a first buffer unit 120, and a second buffer unit 130. And an operation unit 140, and, as described above, may be used as a module of an artificial neural network.

연산 제어부(110)는 인공 신경망의 각 레이어에서 연산이 수행되어야 할 다수의 행렬을 인가받는다. 여기서 연산이 수행되어야 하는 다수의 행렬은 레이어로 인가되는 적어도 하나의 특징맵(또는 입력 이미지)과 각각의 레이어에 지정된 적어도 하나의 커널일 수 있다.The operation control unit 110 receives a plurality of matrices to be performed in each layer of the artificial neural network. Here, the plurality of matrices to be performed may be at least one feature map (or input image) applied to a layer and at least one kernel designated for each layer.

연산 제어부(110)는 인가된 다수의 행렬 중 연산이 수행되어야 하는 2개의 행렬을 선택하고, 선택된 2개의 행렬을 연산 명령과 함께 제1 버퍼부(120) 및 제2 버퍼부(130)로 전달한다. 이때 제1 버퍼부(120)로는 인공 신경망의 레이어로 인가되는 적어도 하나의 특징맵에 대한 피승수 행렬인 A 행렬을 인가하고, 제2 버퍼부(130)로는 인공 신경망의 레이어에 지정된 커널에 대한 승수 행렬인 B 행렬을 인가한다.The operation control unit 110 selects two matrices to be operated from among the applied matrices, and transfers the selected two matrices to the first buffer unit 120 and the second buffer unit 130 together with an operation command. do. At this time, the matrix A, which is a multiplicand matrix for at least one feature map applied to the layer of the artificial neural network, is applied to the first buffer unit 120, and the multiplier for the kernel designated in the layer of the artificial neural network is applied to the second buffer unit 130. Apply the matrix B, which is a matrix.

그리고 연산 제어부(110)는 연산부(140)으로부터 행렬 연산 수행 결과를 인가받아 메모리(미도시) 등으로 전송하여 저장할 수 있다.Further, the operation control unit 110 may receive the result of performing the matrix operation from the operation unit 140 and transmit it to a memory (not shown) and store it.

연산 제어부(110)가 선택된 행렬을 연산 명령과 함께 인가하는 것은 후술하는 본 실시예에 따른 행렬의 곱셈 알고리즘에 기반하여, 각 행렬의 원소를 선택하여 행렬 곱셈 연산을 수행할 수 있도록 하기 위함이다.The operation control unit 110 applies the selected matrix together with the operation command in order to perform a matrix multiplication operation by selecting an element of each matrix based on the matrix multiplication algorithm according to the present embodiment to be described later.

제1 버퍼부(120)는 연산 명령에 따라 연산 제어부(110)에서 인가된 A 행렬에서 열 단위로 원소를 선택하여 연산부(140)로 전달한다. 그리고 제2 버퍼부(130)는 연산 명령에 따라 인가된 B 행렬에서 행 단위로 원소를 선택하여 연산부(140)로 전달한다.The first buffer unit 120 selects an element in a column unit from the matrix A applied from the operation control unit 110 according to an operation command, and transmits it to the operation unit 140. In addition, the second buffer unit 130 selects an element in row units from the applied B matrix according to the operation command and transmits the selected element to the operation unit 140.

연산부(140)는 다수의 연산 처리 레인(SIMDL)을 포함할 수 있다. 그리고 다수의 연산 처리 레인(SIMDL) 각각은 도 5에 도시된 바와 같이, 다수의 프로세스 소자(PE)와 SIMD 유닛(SIMDU)을 포함할 수 있다.The operation unit 140 may include a plurality of operation processing lanes SIMDL. In addition, each of the plurality of processing lanes SIMDL may include a plurality of process elements PE and a SIMD unit SIMDU, as shown in FIG. 5.

최근 행렬 연산기는 연산 효율성을 높이기 위해, 복잡한 연산을 단일 명령으로 일괄 처리할 수 있도록 SIMD(Single Instruction Multiple Data) 기법을 이용하는 것이 일반적이다. SIMD 기법은 다수의 프로세스 소자(PE)들이 동일(또는 유사)한 연산을 다수의 데이터에 적용하여 동시에 처리하는 방식으로, 주로 백터(vector) 프로세서에서 이용되는 기술이다.In recent years, matrix calculators generally use SIMD (Single Instruction Multiple Data) techniques to batch-process complex operations into a single instruction in order to increase operation efficiency. The SIMD technique is a technique mainly used in vector processors in a manner in which a plurality of process elements (PEs) apply the same (or similar) operation to a plurality of data and process them at the same time.

SIMD 기법에서는 명령의 효율성을 극대화 하기 위해, 단일 명령으로 다중 데이터를 처리할 수 있는 다수의 명령어 집합을 저장하고 있다. 그리고 저장된 명령어 집합 각각은 다수의 프로세스 소자(PE)에 대해 데이터 수준 병렬성(Data Level Parallelism; DLP)을 이용하여 동시에 병렬로 연산을 수행하도록 한다. 즉 SIMD 유닛(SIMDU)은 제1 및 제2 버퍼부(120, 130)에서 행 또는 열 단위로 인가되는 원소들에 다수의 프로세스 소자(PE)가 지정된 동일한 연산을 병렬로 수행하도록 한다.In the SIMD technique, in order to maximize the efficiency of the instruction, multiple instruction sets that can process multiple data with a single instruction are stored. In addition, each of the stored instruction sets performs an operation in parallel using data level parallelism (DLP) for a plurality of process elements PE. That is, the SIMD unit (SIMDU) allows the first and second buffer units 120 and 130 to perform the same operation in parallel in which a plurality of process elements PE are assigned to elements applied in a row or column unit.

여기서 SIMD 유닛(SIMDU)은 연산부(140)내의 하드웨어로 구현될 수도 있으나 연산부(140)에서 수행되는 동작을 지정하는 소프트웨어로 구현될 수도 있다. 또한 경우에 따라서는 연산 제어부(110) 내에 구현될 수도 있다.Here, the SIMD unit (SIMDU) may be implemented as hardware in the operation unit 140, but may be implemented as software that designates an operation performed by the operation unit 140. Also, in some cases, it may be implemented in the operation control unit 110.

다수의 프로세스 소자(PE) 각각은 제1 및 제2 버퍼부(120, 130)로부터 A 행렬과 B 행렬의 원소(a, b)들 중 서로 연산되어야 하는 원소들을 인가받아 곱셈 또는 덧셈 연산을 수행한다. 그리고 다수의 프로세스 소자(PE) 각각은 MAC 연산기(Multiply-accumulate operater)로 구현될 수 있다.Each of the plurality of process elements PE receives elements to be computed from among the elements (a, b) of matrix A and matrix B from the first and second buffer units 120 and 130 and performs multiplication or addition operation. do. In addition, each of the plurality of process elements PE may be implemented with a multiply-accumulate operater (MAC).

도 6을 참조하면, 다수의 프로세스 소자(PE) 각각은 곱셈기(MUL), 가산기(ADD) 및 누적 레지스터(ACC)를 포함할 수 있다.Referring to FIG. 6, each of the plurality of process elements PE may include a multiplier MUL, an adder ADD, and an accumulation register ACC.

곱셈기(MUL)는 제1 버퍼부(120)로부터 인가된 A 행렬의 원소(a)와 제2 버퍼부(130)로부터 인가된 B 행렬의 원소(b)를 서로 곱하여 가산기(ADD)로 출력한다. 가산기(ADD)는 곱셈기(MUL)의 출력값과 누적 레지스터(ACC)에 저장된 이전 계산된 누적 부분합을 인가받아 가산하여 누적 부분합을 갱신한다. 누적 레지스터(ACC)는 가산기(ADD)에서 출력되는 갱신된 누적 부분합을 저장한다.The multiplier (MUL) multiplies the element (a) of the matrix A applied from the first buffer unit 120 and the element (b) of the matrix B applied from the second buffer unit 130 to output to the adder ADD. . The adder ADD updates the accumulated subtotal by receiving and adding the output value of the multiplier MUL and the previously calculated accumulated subtotal stored in the accumulation register ACC. The accumulation register ACC stores the updated accumulated subtotal output from the adder ADD.

도 7은 본 발명의 일 실시예에 따른 행렬의 곱셈 연산 알고리즘을 나타낸다.7 shows a matrix multiplication algorithm according to an embodiment of the present invention.

이하에서는 도 4 내지 도 6을 참조하여, 도 7의 행렬의 곱셈 연산 알고리즘을 설명한다.Hereinafter, an algorithm for multiplying the matrix of FIG. 7 will be described with reference to FIGS. 4 to 6.

일반적인 행렬의 곱셈 연산에서는 도 2에 도시된 바와 같이, 피승수 행렬인 A 행렬에서 열 단위(&1, &2, ..., &k)로 원소를 선택하고, 승수 행렬인 B 행렬에서 열 단위(&1, &2, ..., &n)로 원소를 선택하여 선택된 원소들 중 대응하는 원소들을 서로 곱한 후 모두 가산하여 곱셈 연산을 수행하였다.In a general matrix multiplication operation, as shown in FIG. 2, elements are selected in column units (&1, &2, ..., &k) from matrix A, which is a multiplicand matrix, and column units (&1, By selecting elements with &2, ..., &n), the corresponding elements among the selected elements are multiplied with each other, and then all of them are added to perform a multiplication operation.

그에 반해 도 7에 도시된 본 실시예에 따른 행렬의 곱셈 연산 방법에서는 제1 버퍼(120)가 피승수 행렬인 A 행렬에서 열 단위(&1, &2, ..., &k)로 원소를 선택하고, 제2 버퍼(130)가 승수 행렬인 B 행렬에서 행 단위(#1, #2, ..., #k)로 원소를 선택하여 연산부(140)로 전송한다.On the contrary, in the matrix multiplication operation method shown in FIG. 7 according to the present embodiment, the first buffer 120 selects elements in column units (&1, &2, ..., &k) from the matrix A, which is a multiplicand matrix, The second buffer 130 selects elements in row units (#1, #2, ..., #k) from matrix B, which is a multiplier matrix, and transmits them to the operator 140.

연산부(140)의 다수의 연산 처리 레인(SIMDL) 각각은 A 행렬과 B 행렬에서 대응하는 원소들을 인가받아 서로 곱하고, 곱해진 결과를 누적하여 가산한다.Each of the plurality of operation processing lanes SIMDL of the operation unit 140 receives corresponding elements from matrix A and matrix B, multiplies each other, and accumulates and adds the multiplied results.

특히 본 실시예에서 다수의 연산 처리 레인(SIMDL) 각각은 A 행렬에서 열 단위(&1, &2, ..., &k)로 선택된 원소들 중 하나의 원소와 B 행렬의 행 단위(#1, #2, ..., #k)로 선택된 다수의 원소들을 서로 곱한다.In particular, in this embodiment, each of the plurality of operation processing lanes (SIMDL) is one of the elements selected in column units (&1, &2, ..., &k) of matrix A and row units of matrix B (#1, # 2, ..., #k) are multiplied with each other.

일예로 A 행렬의 제1 열(&1)과 B 행렬의 제1 행(#1)인 선택된 경우, 다수의 연산 처리 레인(SIMDL) 중 제1 연산 처리 레인은 A 행렬의 제1 열(&1)의 제1 행의 a 원소(a_0,0)와 B 행렬의 제1 행(#1)의 b 원소들(b_0,0, b_0,1, ..., b_0,n)을 인가받고, a 원소(a_0,0)를 b 원소들(b_0,0, b_0,1, ..., b_0,n) 각각에 곱한다.For example, when the first column (&1) of matrix A and the first row (#1) of matrix B are selected, the first processing lane among the plurality of processing lanes (SIMDL) is the first column (&1) of matrix A. _Apply element a (a _0,0 ) in the first row of and b elements (b _0,0 , b _0,1 , ..., b _0,n ) in the first row (#1) of matrix B And multiply each of the elements a (a _0,0 ) by the elements b (b _0,0 , b _0,1 , ..., b _0,n ).

이때, 제1 연산 처리 레인의 다수개의 프로세스 소자(PE) 각각에서 곱셈기(MUL)는 a 원소(a_0,0)와 b 원소들(b_0,0, b_0,1, ..., b_0,n) 중 대응하는 하나의 b 원소를 인가받아 곱셈하여 가산기(ADD)로 전달한다. 이전 계산된 곱셈 결과가 없으므로, 즉 누적 레지스터(ACC)에 이전 저장된 누적값이 없으므로, 가산기(ADD)는 곱셈기(MUL)의 출력을 그대로 누적 레지스터(ACC)로 전달하여 저장한다.At this time, in each of the plurality of process elements PE of the first processing lane, the multiplier MUL is an element a (a _0,0 ) and elements b (b _0,0 , b _0,1 , ..., b _The corresponding b element among _0,n ) is applied, multiplied, and transferred to the adder ADD. Since there is no previously calculated multiplication result, that is, there is no accumulated value previously stored in the accumulation register ACC, the adder ADD transfers the output of the multiplier MUL as it is to the accumulation register ACC and stores it.

즉 제1 연산 처리 레인은 A 행렬의 제1행 제1열의 원소(a_0,0)와 B 행렬의 제1 행(#1)의 원소들(b_0,0, b_0,1, ..., b_0,n) 사이의 곱셈 결과를 제1 누적 행렬(C⁰)의 제1 행(#1)의 원소값(c⁰ _0,0, c⁰ _0,1, ..., c⁰ _0,n)으로 획득한다. 그리고 획득된 제1 행(#1)의 원소값(c⁰ _0,0, c⁰ _0,1, ..., c⁰ _0,n)은 각 프로세스 소자(PE)의 누적 레지스터(ACC)에 저장된다.That is, the first processing lane is the elements (a _0,0 ) of the first row and the first column of matrix A and the elements of the first row (#1) of the matrix B (b _0,0 , b _0,1 , .. The multiplication result between ., b _0,n ) is the element value (c ⁰ _0,0 , c ⁰ _0,1 , ..., c ⁰ ) of the first row (#1) of the first cumulative matrix (C ⁰ ) _It is obtained as _0,n ). And the obtained element values (c ⁰ _0,0 , c ⁰ _0,1 , ..., c ⁰ _0,n ) of the first row (#1) are _{stored in} the accumulation register (ACC) of each process element (PE). Is saved.

한편, 제2 연산 처리 레인은 A 행렬의 제1 열(&1)의 제2 행의 원소(a_1,0)와 B 행렬의 제1 행(#1)의 b 원소들(b_0,0, b_0,1, ..., b_0,n)을 인가받고, a 원소(a_1,0)를 b 원소들(b_0,0, b_0,1, ..., b_0,n) 각각에 곱하여, 제1 누적 행렬(C⁰)의 제2 행(#2)의 원소값(c⁰ _1,0, c⁰ _1,1, ..., c⁰ _1,n)으로 획득하여 저장한다.Meanwhile, the second arithmetic processing lane is an element (a _1,0 ) of the second row of the first column (&1) of matrix A and the b elements (b _0,0 ,) of the first row (#1) of matrix B. b _0,1 , ..., b _0,n ) is applied, and a element (a _1,0 ) is _assigned to b elements (b _0,0 , b _0,1 , ..., b _0,n ) By multiplying each, it is acquired and stored as the element value (c ⁰ _1,0 , c ⁰ _1,1 , ..., c ⁰ _1,n ) of the second row (#2) of the first cumulative matrix (C ⁰ ) do.

이와 같은 방식으로 제m 연산 처리 레인은 A 행렬의 제1 열(&1)의 제m 행의 a 원소(a_m,0)와 B 행렬의 제1 행(#1)의 b 원소들(b_0,0, b_0,1, ..., b_0,n)을 곱하여, 제1 누적 행렬(C⁰)의 제m 행(#m)의 원소값(c⁰ _m,0, c⁰ _m,1, ..., c⁰ _m,n)으로 획득하여 저장한다.In this way, the mth arithmetic processing lane is an element a (a _m,0 ) of the mth row of the first column (&1) of the matrix A and the b elements (b ₀ ) of the first row (#1) of the matrix B. _,0 , b _0,1 , ..., b _0,n ) by multiplying the element values of the mth row (#m) of the first cumulative matrix (C ⁰ ) (c ⁰ _m,0 , c ⁰ _{m, 1} , ..., c ⁰ _m,n ) and store.

즉 연산부(140)의 다수의 연산 처리 레인(SIML)에 의해 A 행렬의 제1 열(&1)의 모든 원소와 B 행렬의 제1 행(#1)의 모든 원소에 대한 곱셈이 1회의 연산으로 동시에 수행된다.That is, the multiplication of all the elements in the first column (&1) of the matrix A and all the elements in the first row (#1) of the matrix B is performed in one operation by a plurality of operation processing lanes (SIML) of the operation unit 140. It is performed simultaneously.

이후, 제1 버퍼(120)는 A 행렬의 제2 열(&2)의 a 원소들(a_0,1, a_1,1, ..., a_m,1)을 각각 다수의 연산 처리 레인(SIMDL) 중 대응하는 연산 처리 레인으로 전달하고, 제2 버퍼(130)는 B 행렬의 제2 행(#2)의 b 원소들(b_1,0, b_1,1, ..., b_1,n)을 다수의 연산 처리 레인(SIMDL) 각각으로 전달한다.Thereafter, the first buffer 120 transfers the a elements (a _0,1 , a _1,1 , ..., a _m,1 ) of the second column (&2) of the matrix A to a plurality of processing lanes ( SIMDL) to the corresponding operation processing lane, and the second buffer 130 transmits the b elements (b _1,0 , b ₁ , ₁ , ..., b ₁ ) of the second row (#2) of the B matrix _,n ) are transferred to each of a plurality of operation processing lanes (SIMDL).

이에 연산 처리 레인(SIMDL) 각각의 프로세스 소자(PE)에서는 곱셈기(MUL)가 인가된 제2 열(&2)의 하나의 a 원소(a_0,1, a_1,1, ..., a_m,1)와 제2 행(#2)의 b 원소들(b_1,0, b_1,1, ..., b_1,n) 중 대응하는 b 원소를 곱하고, 가산기(ADD)가 곱셈기(MUL)에서 출력되는 곱셈 결과에 이전 획득되어 누적 레지스터(ACC)에 저장된 누적값(c⁰ _0,0, c⁰ _0,1, ..., c⁰ _0,n)을 인가받아 가산하여, 제2 누적 행렬(C¹)의 제1 행(#1)의 원소값(c¹ _0,0, c¹ _0,1, ..., c¹ _0,n)으로 획득하고, 획득된 원소값(c¹ _0,0, c¹ _0,1, ..., c¹ _0,n)을 다시 누적 레지스터(ACC)에 저장한다.Accordingly, in each process element (PE) of the operation processing lane (SIMDL), one a element (a _0,1 , a _1,1 , ..., a _m ) of the second column (&2) to which a multiplier (MUL) is applied _,1 ) and the corresponding b element among the b elements (b _1,0 , b _1,1 , ..., b _1,n ) in the second row (#2) are multiplied, and the adder (ADD) is a multiplier ( MUL), the accumulated value (c ⁰ _0,0 , c ⁰ _0,1 , ..., c ⁰ _0,n ) previously acquired and stored in the accumulator register (ACC) is applied and added, 2 Obtained as the element values (c ¹ _0,0 , c ¹ _0,1 , ..., c ¹ _0,n ) of the first row (#1) of the cumulative matrix (C ¹ ), and the obtained element values ( c ¹ _0,0 , c ¹ _0,1 , ..., c ¹ _0,n ) are stored in the accumulating register (ACC) again.

유사하게 제2 연산 처리 레인은 A 행렬의 제2 열(&2)의 제2 행의 원소(a_1,1)와 B 행렬의 제2 행(#2)의 원소들(b_1,0, b_1,0, ..., b_1,n)을 인가받고, a 원소(a_1,1)를 원소들(b_1,0, b_1,0, ..., b_1,n) 각각에 곱하고, 곱셈 결과를 누적 레지스터(ACC)에 저장된 누적값(c⁰ _1,0, c⁰ _1,1, ..., c⁰ _1,n)과 가산하여, 제2 누적 행렬(C¹)의 제2 행(#2)의 원소값(c¹ _1,0, c¹ _1,1, ..., c¹ _1,n)으로 획득하여 저장한다.Similarly, the second arithmetic processing lane is the elements (a _1,1 ) of the second row of the second column (&2) of matrix A and the elements (b _1,0 , b) of the second row of matrix B (#2). _1,0 , ..., b _1,n ) is applied, and element a (a _1,1 ) is _assigned to each of the elements (b _1,0 , b _1,0 , ..., b _1,n ) Multiply and add the result of the multiplication with the accumulated value (c ⁰ _1,0 , c ⁰ _1,1 , ..., c ⁰ _1,n ) stored in the accumulation register (ACC), and the second accumulation matrix (C ¹ ) is It is acquired and stored as the element values (c ¹ _1,0 , c ¹ _1,1 , ..., c ¹ _1,n ) of the second row (#2).

그리고 제m 연산 처리 레인은 A 행렬의 제2 열(&2)의 제m 행의 a 원소(a_m,0)와 B 행렬의 제2 행(#2)의 b 원소들(b_1,0, b_1,0, ..., b_1,n)을 곱하고, 누적값(c⁰ _m,0, c⁰ _m,1, ..., c⁰ _m,n)을 가산하여, 제2 누적 행렬(C¹)의 제m 행(#m)의 원소값(c¹ _m,0, c¹ _m,1, ..., c¹ _m,n)으로 획득하여 저장한다.In addition, the m-th operation processing lane is an element a (a _m,0 ) in the m-th row of the second column (&2) of the matrix A and the b elements (b _1,0 ,) of the second row (#2) of the matrix B. The second cumulative matrix by multiplying b _1,0 , ..., b ₁ _,n ) and adding the cumulative values (c ⁰ _m,0 , c ⁰ _m,1 , ..., c ⁰ _m,n ) and (c ¹⁾ stores the value obtained with the elements ^{_{(c 1 m, 0, c}} 1 m, 1, ..., c 1 m, n) of the m-th row (#m) of the.

이와 같이 연산부(140)는 A 행렬의 제k 열(&k)과 B 행렬의 제k 행(#k)까지의 원소들을 순차적으로 인가받고, 인가된 원소들을 곱하고 이전 계산된 누적값과 가산하여 최종적으로 A 행렬과 B 행렬의 행렬 곱셈 결과인 제k 누적 행렬(C^k)을 획득한다.In this way, the operation unit 140 sequentially receives the elements up to the kth column (&k) of the matrix A and the kth row (#k) of the matrix B, multiplies the applied elements, and adds the previously calculated cumulative value to finally As a result of multiplying the matrix A and matrix B, the k-th cumulative matrix C ^k is obtained.

상기한 행렬 곱셈 알고리즘을 다수의 연산 처리 레인(SIMDL)의 다수의 프로세스 소자(PE) 각각의 관점에서 다시 설명하면, 제p SIDM 레인의 다수의 프로세스 소자(PE) 각각은 A 행렬에서 대응하는 제p 행의 a 원소들(a_p,0, a_p,1, ..., a_p,k) 각각을 B 행렬에서 대응하는 제p 열의 b 원소들(b_0,p, b_1,p, ..., b_k,p)과 순차적으로 곱하고, 곱셈 결과를 이전 곱셈 결과의 누적값에 가산하여 제k 누적 행렬(C^k)의 제p 행의 원소(c^k _p,0, c^k _p,0, ..., c^k _p,n)값으로 획득한다.When the above matrix multiplication algorithm is described again in terms of each of the plurality of process elements PE of the plurality of processing lanes SIMDL, each of the plurality of process elements PE of the p-th SIDM lane corresponds to the corresponding first in matrix A. Each of the a elements in row p (a _p,0 , a _p,1 , ..., a _p,k ) is assigned to the corresponding b elements in the pth column in the B matrix (b _0,p , b _1,p , ..., b _k,p ) is sequentially multiplied, and the multiplication result is added to the accumulated value of the previous multiplication result, and the elements of the pth row of the kth accumulation matrix (C ^k ) (c ^k _p,0 , c ^k _{p ,0} , ..., c ^k _{p, n} ) value.

그리고 이러한 프로세스 소자(PE) 각각의 계산 방식은 도 2에 도시된 일반적인 행렬의 곱셈 연산에서 C 행렬의 하나의 원소(#, &)를 계산하는 과정과 동일하다. 다만 도 2의 알고리즘의 경우, 연산부가 C 행렬의 하나의 원소(#, &)를 계산하기 위해 요구되는 A 행렬과 B 행렬의 원소들을 동시에 인가받아 곱셈을 수행하고, 곱셈 수행 결과에 대해 다시 덧셈 연산을 반복적으로 수행해야 하므로, 도 3에 도시된 바와 같이, C 행렬의 각 원소(#, &)를 획득할 때마다 한번의 곱셈 연산 이후 다수 횟수로 덧셈 연산을 수행해야 하였다.In addition, the calculation method of each of the process elements PE is the same as the process of calculating one element (#, &) of the C matrix in the general matrix multiplication operation shown in FIG. 2. However, in the case of the algorithm of Fig. 2, the operation unit performs multiplication by simultaneously applying the elements of matrix A and matrix B required to calculate one element (#, &) of matrix C, and then adds the result of the multiplication. Since the operation has to be repeatedly performed, as shown in FIG. 3, each time each element (#, &) of the C matrix is acquired, the addition operation has to be performed multiple times after the multiplication operation.

그러나 도 7에 도시된 본 실시예에 따른 행렬 곱셈 알고리즘에서는 순차적으로 누적되는 누적 행렬(C⁰, C¹, ..., C^k)을 획득하므로, k번의 곱셈 연산과 k번의 덧셈 연산만으로 A 행렬과 B 행렬 사이의 곱셈 결과인 C 행렬을 획득할 수 있다.However, in the matrix multiplication algorithm according to the present embodiment shown in FIG. 7, since the cumulative matrices (C ⁰ , C ¹ , ..., C ^k ) that are sequentially accumulated are obtained, only k multiplication operations and k addition operations A A matrix C, which is a result of multiplication between a matrix and a matrix B, can be obtained.

특히 다수의 연산 처리 레인(SIMDL)의 다수의 프로세스 소자(PE) 각각에서 곱셈기(MUL)는, 가산기(ADD)가 이전 곱셈기(MUL)에서 출력되는 곱셈 결과와 누적 레지스터(ACC)에 저장된 누적값을 가산하는 동안, 다음 곱셈 연산되어야 하는 a 원소와 b 원소를 인가받아 곱셈 연산을 수행할 수 있다. 즉 파이프 라인(Pipeline) 기법에 따라 곱셈기(MUL)와 가산기(ADD)가 동시 연산을 수행할 수 있다.In particular, the multiplier (MUL) in each of the plurality of process elements (PE) of the plurality of operation processing lanes (SIMDL) is the multiplication result output from the previous multiplier (MUL) and the accumulated value stored in the accumulation register (ACC). During addition, a multiplication operation can be performed by receiving elements a and b to be multiplied next. That is, a multiplier (MUL) and an adder (ADD) may perform simultaneous operations according to a pipeline technique.

이는 A 행렬과 B 행렬 사이의 곱셈 결과인 C 행렬을 획득하기 위해 2k 만큼의 연산 시간이 소요되는 것이 아니라, k+1 만큼의 연산 시간이 소요되는 것을 의미한다.This means that it takes not as much as 2k of operation time to obtain the C matrix, which is the result of the multiplication between the A matrix and the B matrix, but as much as k+1.

즉 행렬의 곱셈 연산에서 덧셈 연산을 위한 시간을 거의 필요로 하지 않도록 하여 행렬 곱셈 연산 시간을 크게 줄일 수 있다.That is, the matrix multiplication operation time can be greatly reduced by making the matrix multiplication operation rarely require time for the addition operation.

도 8은 본 발명의 일 실시예에 따른 행렬 연산 방법을 나타낸다.8 shows a method for calculating a matrix according to an embodiment of the present invention.

도 4 내지 도 7을 참조하여, 도 8의 행렬 연산 방법을 설명하면, 우선 곱셈 연산 대상이 되는 2개의 행렬을 획득한다(S10). 2개의 행렬 중 하나는 m × k 크기의 피승수 행렬로서 A 행렬이라 하고, 나머지 하나는 k × n 크기의 승수 행렬로서 B 행렬이라 할 수 있다. 여기서 A 행렬은 인공 신경망의 각 레이어에 입력되는 특징맵(f.map)(또는 입력 이미지(Input))의 전체 또는 일부 일 수 있으며, B 행렬은 각 레이어에 기지정된 커널의 전체 또는 일부 일 수 있다.Referring to FIGS. 4 to 7, when the matrix operation method of FIG. 8 is described, first, two matrices to be multiplied are obtained (S10 ). One of the two matrices is an m × k multiplicand is referred to as an A matrix, and the other is a k × n multiplier matrix and may be referred to as a B matrix. Here, the matrix A may be all or part of the feature map (f.map) (or input image) input to each layer of the artificial neural network, and the matrix B may be all or part of the kernel specified in each layer. have.

연산 대상인 2개의 행렬이 획득되면, 피승수 행렬인 A 행렬에서 제i 열을 선택한다(S20). 그리고 승수 행렬인 B 행렬에서 제i 행을 선택한다(S30). 여기서 i의 초기값은 1로서, 우선 A 행렬의 제1 열과 B 행렬의 1행을 선택한다.When two matrices to be calculated are obtained, the i-th column is selected from the matrix A, which is the multiplicand matrix (S20). Then, the ith row is selected from the matrix B, which is a multiplier matrix (S30). Here, the initial value of i is 1, and first column of matrix A and 1 row of matrix B are selected.

그리고 선택된 A 행렬의 제i 열의 원소(a_0,i, a_1,i, ..., a_m,i) 각각을 선택된 B 행렬의 제i 행의 모든 원소(b_i,0, b_i,1, ..., b_i,n)와 곱하여, m × n개의 곱셈 결과를 획득한다(S40).Then, each of the elements in the i-th column of the selected matrix A (a _0,i , a _1,i , ..., a _m,i ) is selected from all the elements in the i-th row of the selected matrix B (b _i,0 , b _{i, By} multiplying by ₁ , ..., b _{i, n} ), m × n multiplication results are obtained (S40).

m × n개의 곱셈 결과를 획득되면, 획득된 곱셈 결과를 이전 획득된 누적값에 가산한다(S50). 만일 이전 획득된 누적값이 없으면, 곱셈 결과를 초기 누적값으로 획득하고, 이전 획득된 누적값이 있으면, 획득된 곱셈 결과를 이전 획득된 누적값에 가산한 결과를 갱신된 누적값으로 저장한다(S60).When m × n multiplication results are obtained, the obtained multiplication result is added to the previously obtained accumulated value (S50). If there is no previously acquired cumulative value, the multiplication result is acquired as an initial cumulative value, and if there is a previously acquired cumulative value, the result of adding the acquired multiplication result to the previously acquired cumulative value is stored as an updated cumulative value ( S60).

그리고 i가 A 행렬의 열 개수 또는 B 행렬의 행 개수인 k보다 작은지 판별한다(S70). 만일 i가 k보다 작으면(i < k), i를 i+1로 변경한다(S80). 이에 A 행렬과 B 행렬에서 이전 선택된 다음 열과 다음 행을 선택한다(S20).Then, it is determined whether i is less than the number of columns of matrix A or k, which is the number of rows of matrix B (S70). If i is less than k (i <k), i is changed to i+1 (S80). Accordingly, the next column and the next row previously selected from the matrix A and the matrix B are selected (S20).

그러나, i가 k이상이면, 저장된 m × n 크기의 누적값으로 구성된 누적 행렬을 A 행렬과 B 행렬의 행렬 곱셈 결과인 C 행렬로서 출력한다(S90).However, if i is greater than or equal to k, a cumulative matrix composed of the stored m × n cumulative values is output as a C matrix, which is a result of matrix multiplication of matrix A and matrix B (S90).

본 발명에 따른 방법은 컴퓨터에서 실행 시키기 위한 매체에 저장된 컴퓨터 프로그램으로 구현될 수 있다. 여기서 컴퓨터 판독가능 매체는 컴퓨터에 의해 액세스 될 수 있는 임의의 가용 매체일 수 있고, 또한 컴퓨터 저장 매체를 모두 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함하며, ROM(판독 전용 메모리), RAM(랜덤 액세스 메모리), CD(컴팩트 디스크)-ROM, DVD(디지털 비디오 디스크)-ROM, 자기 테이프, 플로피 디스크, 광데이터 저장장치 등을 포함할 수 있다.The method according to the present invention may be implemented as a computer program stored in a medium for execution on a computer. Here, the computer-readable medium may be any available medium that can be accessed by a computer, and may also include all computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, and ROM (Read Dedicated memory), RAM (random access memory), CD (compact disk)-ROM, DVD (digital video disk)-ROM, magnetic tape, floppy disk, optical data storage device, and the like.

본 발명은 도면에 도시된 실시예를 참고로 설명되었으나 이는 예시적인 것에 불과하며, 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다.The present invention has been described with reference to the embodiments shown in the drawings, but these are merely exemplary, and those of ordinary skill in the art will appreciate that various modifications and other equivalent embodiments are possible therefrom.

따라서, 본 발명의 진정한 기술적 보호 범위는 첨부된 청구범위의 기술적 사상에 의해 정해져야 할 것이다.Therefore, the true technical protection scope of the present invention should be determined by the technical spirit of the appended claims.

110: 연산 제어부 120: 제1 버퍼부
130: 제2 버퍼부 140: 연산부
SIMDL: 연산 처리 레인 PE: 프로세스 소자
SIMDU: SIMD 유닛 MUL: 곱셈기
ADD: 가산기 ACC: 누적 레지스터110: operation control unit 120: first buffer unit
130: second buffer unit 140: operation unit
SIMDL: processing lane PE: process element
SIMDU: SIMD unit MUL: Multiplier
ADD: Adder ACC: Accumulated register

Claims

A first buffer that receives and stores a first matrix that is a multiplicand matrix;
A second buffer for receiving and storing a second matrix, which is a multiplier matrix multiplied by the first matrix; And
In the first matrix, a plurality of elements sequentially selected in column units are applied, and a plurality of elements sequentially selected in row units in the second matrix are applied in correspondence with the columns selected in the first matrix, and selected in the first matrix. The first matrix and the second matrix are multiplied by multiplying each element of a column with all the elements of the row selected from the second matrix, and the result of the multiplication operation between the columns of the first matrix and the rows of the second matrix that are sequentially selected. A matrix operator including an operator that obtains a result matrix that is a result of a matrix multiplication operation of.

The method of claim 1, wherein the operation unit
Each of the elements of the column selected in the first matrix is multiplied by applying a corresponding one of the elements in the column selected in the first matrix and all the elements in the row selected in the second matrix, and the multiplication result is added to the cumulative value of the previous multiplication result A matrix operator including a plurality of processing lanes for obtaining elements of a row of an accumulation matrix.

The method of claim 1, wherein each of the plurality of processing lanes
While adding the inter-element multiplication result to the accumulated accumulated value of the previous multiplication result, a multiplication operation is performed by applying elements of a column from the first matrix and all elements of a row from the second matrix selected next according to a predetermined sequence. Matrix operator to perform.

The method of claim 2, wherein each of the plurality of processing lanes
It includes a number of process elements,
Each of the plurality of process elements
A multiplier for multiplying by receiving a corresponding element in a selected column of the first matrix and a corresponding element among a plurality of elements in a row selected in the second matrix;
An adder for updating the accumulated value by adding the multiplication result output from the multiplier to the accumulated value obtained by accumulating the multiplication result of the previously applied element; And
A matrix operator including an accumulation register that stores an accumulated value updated by the adder.

The method of claim 1, wherein the second buffer
When an i-th column of the first matrix (where i is a natural number) is selected from the first buffer, a matrix operator that selects the i-th row of the second matrix.

The method of claim 1, wherein the matrix operator
Implemented as an artificial neural network module for performing a specified operation on at least one layer among a plurality of layers of the artificial neural network,
The first matrix is a feature map applied to the at least one layer, and the second matrix is a kernel predetermined to the at least one layer.

Receiving and storing a first matrix that is a multiplicand matrix and a second matrix that is a multiplier matrix that is multiplied by the first matrix;
Receiving a plurality of elements sequentially selected in column units in the first matrix and a plurality of elements sequentially selected in row units in the second matrix corresponding to columns selected in the first matrix;
Multiplying each element of the column selected in the first matrix with all the elements in the row selected in the second matrix; And
And obtaining a result matrix that is a result of a matrix multiplication operation of the first matrix and the second matrix by accumulating and adding the result of a multiplication operation between the columns of the first matrix and the rows of the second matrix that are sequentially selected.

The method of claim 7, wherein obtaining the result matrix
A matrix operation method in which an element of a partial accumulation matrix is obtained by adding an inter-element multiplication result obtained in the step of multiplying with all the elements to a cumulative value obtained by accumulating a previously corresponding inter-element multiplication result.

The method of claim 7, wherein each of the plurality of processing lanes
While adding the inter-element multiplication result to the accumulated accumulated value of the previous multiplication result, a multiplication operation is performed by applying elements of a column from the first matrix and all elements of a row from the second matrix selected next according to a predetermined sequence. How to perform matrix operations.

The method of claim 7, wherein the step of receiving the selected plurality of elements
When an i-th column of the first matrix (where i is a natural number) is selected, a plurality of elements of the i-th row of the second matrix are applied.