KR20200048750A

KR20200048750A - Machine learning accelerator and matrix operating method thereof

Info

Publication number: KR20200048750A
Application number: KR1020180131185A
Authority: KR
Inventors: 한태희; 이승찬
Original assignee: 성균관대학교산학협력단
Priority date: 2018-10-30
Filing date: 2018-10-30
Publication date: 2020-05-08
Also published as: KR102191428B1

Abstract

The present invention relates to a machine learning accelerator and a matrix operation method thereof, to accelerate a matrix operation. According to an embodiment of the present invention, the matrix operation method of a machine learning accelerator including a matrix operation unit consisting of NxN operators comprises: (a) a step of setting a region on which data transmitted in a first direction, a second direction, a third direction, and a fourth direction of the matrix operation unit cross, as a data transmission boundary, dividing a first area and a second area with respect to the data transmission boundary, and then limiting data transmission between the first area and the second area; (b) a step of receiving, the matrix operation unit, input data in at least two directions among the first, second, third, and fourth directions; and (c) a step of using, by the NxN operators, the input data at each cycle to process operations, and summing, by the matrix operation unit, operation results of the operators to output the same in a column direction. Operators positioned on the data transmission boundary perform operations using data transmitted from the first area and operations using data transmitted from the second area.

Description

MACHINE LEARNING ACCELERATOR AND MATRIX OPERATING METHOD THEREOF

본 발명은 행렬 연산 유닛의 4방향에서 데이터가 입력되어 연산처리를 수행함으로써 행렬 연산이 가속화될 수 있는 머신러닝 가속기 및 그의 행렬 연산 방법에 관한 것이다.The present invention relates to a machine learning accelerator capable of accelerating matrix operations by inputting data in four directions of a matrix operation unit and performing arithmetic processing, and a matrix computation method thereof.

최근 인공신경망 알고리즘과 머신러닝(Machine Learning)에 대한 활발한연구 결과로 인해, 이미지 인식, 자연어 처리 등의 특정 분야에서 그 정확도가 인간 수준까지 올라가고 있으며, 향후 자율주행 자동차, 자동화 시스템 등의 분야에서도 큰 성과를 보일 것으로 예측된다. 특히, 알고리즘의 정확도를 높이는 연구가 가장 활발히 진행되고 있으며, 그러한 알고리즘을 빠르고 효율적으로 구현하는 하드웨어 가속기에 대한 연구도 활발히 진행되고 있다.As a result of recent active research on artificial neural network algorithms and machine learning, its accuracy is rising to the human level in certain fields such as image recognition and natural language processing. It is expected to show results. In particular, researches to improve the accuracy of algorithms are most actively conducted, and studies on hardware accelerators that implement such algorithms quickly and efficiently are also actively conducted.

이와 같이, 머신러닝(Machine Learning)을 실생활에 사용하는 시대로 진입하면서 인공 지능을 학습시키고, 학습된 모델을 가지고 추론하는 과정에서 방댕한 양이 연산이 발생하게 된다. 이로 인해, 기존의 컴퓨터 구조로는 방대한 양의 연산을 처리하는데 한계가 있기 때문에 구글에서 개발한 TPU(Tensor Processing Unit)와 같은 행렬 연산에 최적화된 하드웨어 연산기가 등장하고 있다. As described above, as the machine learning (Machine Learning) is used in real life, artificial intelligence is learned, and an insignificant amount of computation occurs in the process of inference with the trained model. For this reason, the existing computer structure has limitations in processing a large amount of operations, and thus, a hardware operator optimized for matrix operation such as a TPU (Tensor Processing Unit) developed by Google has appeared.

딥러닝은 데이터를 입력받아 신경망을 학습하는 트레이닝 (training) 과정과 학습된 신경망으로 자료 인식 등을 수행하는 인퍼런스(inference) 과정으로 구성된다. 여기서, TPU는 딥러닝 인퍼런스 전용 가속기로서, PCIe 카드 타입의 코프로세서로 설계되어 있어, TPU의 제어 명령어와 데이터는 호스트에서 PCIe 인터페이스를 통해 전해진다.Deep learning is composed of a training process that trains a neural network by receiving data and an inference process that performs data recognition with the trained neural network. Here, the TPU is a deep learning reference dedicated accelerator, designed as a PCIe card type coprocessor, so that the control instructions and data of the TPU are transmitted from the host through the PCIe interface.

도 1은 일반적인 시스톨릭 배열 기반의 행렬 연산기의 구조를 설명하는 도면이고, 도 2는 도 1의 행렬 연산기의 임계 경로를 설명하는 도면이다.1 is a diagram illustrating a structure of a general systolic array-based matrix operator, and FIG. 2 is a diagram illustrating a critical path of the matrix operator of FIG. 1.

도 1에 도시된 바와 같이, TPU는 N x N 행렬 연산기를 통해 입력된 데이터와 파라미터를 연산하는데, 이 행렬 연산기들은 시스톨릭 어레이(systolic array) 구조로 배치되어 있다. 따라서, 각 연산기는 West 방향 또는 North 방향에서 데이터가 입력되고, 이렇게 입력된 데이터에 대해 곱셈, 덧셈 연산을 수행한 후 열 방향으로 연산 결과를 출력한다. As shown in FIG. 1, the TPU calculates data and parameters input through an N x N matrix operator, and these matrix operators are arranged in a systolic array structure. Accordingly, each operator inputs data in the West direction or the North direction, performs multiplication and addition operations on the input data, and outputs the operation result in the column direction.

이를 위해, 각 연산기는 1개의 곱셈기, 1개의 덧셈기, 버퍼 및 전송기(Forwarding unit)로 구성되고, 도 2에 도시된 바와 같이, N x N 행렬 연산을 수행할 때 2N-1 사이클 크기의 입력 데이터를 구성할 수 있으며, 그로 인해 임계 경로는 (2N-1)+N-1= 3N-2 사이클이므로 결국 3N-2사이클로 한번의 행렬 연산이 발생한다. To this end, each operator is composed of one multiplier, one adder, a buffer and a forwarding unit, and as illustrated in FIG. 2, input data having a size of 2N-1 cycles when performing an N x N matrix operation And the critical path is (2N-1) + N-1 = 3N-2 cycles, so that one matrix operation occurs in 3N-2 cycles.

도 3은 일반적인 N x N 행렬 연산을 설명하는 도면이고, 도 4는 도 1의 행렬 연산기의 연산 방식을 설명하는 도면이다.3 is a diagram illustrating a general N x N matrix operation, and FIG. 4 is a diagram illustrating a calculation method of the matrix operator of FIG. 1.

도 3 및 도 4를 참조하면, 일반적인 시스톨릭 배열 기반의 행렬 연산기는 가로 N 크기, 세로 N 크기의 행렬 연산을 수행할 경우, 서쪽 방향에서 데이터를 입력받고, 북쪽 방향에서 데이터를 입력받아 각 연산기에서 연산하는 구조를 가진다.Referring to FIGS. 3 and 4, when performing a matrix operation having a horizontal N size and a vertical N size, a typical systolic array-based matrix operator receives data from the west direction and receives data from the north direction. It has a structure to operate on.

시스톨릭 배열 기반의 행렬 연산기는 각 연산기의 출력 값이 인접한 다음 줄의 연산기의 입력으로 전달되는 방식으로 매번 출력 값을 메모리에 저장해야 하는 일반적인 연산 방법보다 메모리 액세스 빈도를 낮춰 전력 소모를 줄이는데 유리하다.The systolic array-based matrix operator is advantageous in reducing power consumption by lowering the frequency of memory access than in the general operation method, in which the output value of each operator is stored in the memory every time, as the output value of each operator is passed to the input of the next adjacent line .

그러나, 시스톨릭 배열 기반의 행렬 연산기는 데이터 입력이 한 방향 또는 두 방향에서만 이루어지기 때문에 마지막 입력 데이터가 결과값이 저장될 행렬의 마지막 부분까지 도달해야만 연산이 마무리될 수 있다. 이 경우, 행렬의 크기가 커질수록 도달 시간이 증가하게 되는 문제점이 있다. However, since the systolic array-based matrix operator performs data input in only one direction or two directions, the calculation can be completed only when the last input data reaches the last part of the matrix in which the result value is to be stored. In this case, there is a problem that the arrival time increases as the size of the matrix increases.

본 발명은 전술한 문제점을 해결하기 위하여, 본 발명의 일 실시예에 따라 행렬 연산 유닛의 4방향에서 데이터가 입력되어 연산처리를 수행할 수 있도록 변형된 시스톨릭 배열에 기반한 행렬 연산 유닛을 제공함으로써 행렬 연산이 가속화되도록 하는 것에 목적이 있다.In order to solve the above-described problem, the present invention provides a matrix calculation unit based on a modified systolic array so that data can be input and performed in four directions of a matrix calculation unit according to an embodiment of the present invention. The goal is to make matrix operations accelerate.

다만, 본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다.However, the technical problem to be achieved by the present embodiment is not limited to the technical problem as described above, and other technical problems may exist.

상기한 기술적 과제를 달성하기 위한 기술적 수단으로서 본 발명의 일 실시예에 따른 머신러닝 가속기의 행렬 연산 방법은, N x N 개의 연산기로 이루어진 행렬 연산 유닛을 포함하는 머신러닝 가속기의 행렬 연산 방법에 있어서, a) 상기 행렬 연산 유닛의 제1 방향, 제2 방향, 제3 방향 및 제4 방향에서 전송되는 각 데이터가 교차되는 영역을 데이터 전송 경계선으로 설정하고, 상기 데이터 전송 경계선을 기준으로 제1 구역과 제2 구역으로 구분한 후 상기 제1 구역과 제2 구역 간의 데이터 전송을 제한하는 단계; b) 상기 행렬 연산 유닛은 상기 제1 방향, 제2 방향, 제3 방향 및 제4 방향 중 적어도 2방향에서 입력 데이터가 입력되는 단계; 및 c) 상기 N x N 개의 연산기는 매 사이클마다 상기 입력 데이터를 이용하여 연산 처리하고, 상기 행렬 연산 유닛은 각 연산기의 연산 결과들을 합산하여 열 방향으로 출력하는 단계를 포함하되, 상기 데이터 전송 경계선에 위치한 연산기들은 상기 제1 구역에서 전송되는 데이터를 이용한 연산 및 상기 제2 구역에서 전송되는 데이터를 이용한 연산을 각각 수행하는 것이다. A matrix computing method of a machine learning accelerator according to an embodiment of the present invention as a technical means for achieving the above technical problem is in a matrix computing method of a machine learning accelerator including a matrix computing unit consisting of N x N operators. , a) Set an area where each data transmitted in the first direction, the second direction, the third direction, and the fourth direction of the matrix operation unit intersects as a data transmission boundary line, and a first zone based on the data transmission boundary line. Limiting data transmission between the first zone and the second zone after dividing into the second zone and the second zone; b) inputting input data from at least two of the first direction, the second direction, the third direction, and the fourth direction; And c) calculating the N x N operators using the input data every cycle, and the matrix operation unit summing the calculation results of each operator and outputting the result in a column direction. The operators located at are to perform calculations using data transmitted in the first zone and calculations using data transmitted in the second zone, respectively.

본 발명의 일 실시예에 따른 머신러닝 가속기는, N x N 행렬 연산을 수행하는 머신러닝 가속기에 있어서, 제1 방향, 제2 방향, 제3 방향 및 제4 방향 중 적어도 2방행에서 입력 데이터가 입력되고, 상기 입력 데이터를 이용하여 행렬 연산을 수행하는 N x N 개의 연산기로 이루어진 행렬 연산 유닛을 포함하되, 상기 행렬 연산 유닛은 상기 제1 방향, 제2 방향, 제3 방향 및 제4 방향에서 전송되는 각 데이터가 교차되는 영역을 데이터 전송 경계선으로 설정하고, 상기 데이터 전송 경계선을 기준으로 제1 구역과 제2 구역으로 구분한 후 상기 제1 구역과 제2 구역 간에 데이터 전송을 제한하고, 상기 행렬 연산 유닛 중 상기 데이터 전송 경계선에 위치한 연산기들은 상기 제1 구역에서 전송되는 데이터를 이용한 연산 및 상기 제2 구역에서 전송되는 데이터를 이용한 연산을 각각 수행하는 것이다. A machine learning accelerator according to an embodiment of the present invention, in a machine learning accelerator performing an N x N matrix operation, input data may be input in at least two directions of a first direction, a second direction, a third direction, and a fourth direction. And a matrix operation unit comprising N x N operators that perform matrix operations using the input data, wherein the matrix operation units are in the first direction, the second direction, the third direction, and the fourth direction. An area where each data to be transmitted is crossed is set as a data transmission boundary line, divided into a first area and a second area based on the data transmission boundary line, and data transmission between the first area and the second area is restricted, and the Of the matrix operation units, operators located at the data transmission boundary line use data transmitted from the first zone and data transmitted from the second zone. To perform the acid, respectively.

전술한 본 발명의 과제 해결 수단에 의하면, 행렬 연산 유닛의 4방향에서 데이터 입력이 가능하도록 하고, 데이터 전송 경계선을 기준으로 데이터 전송을 제한하는 변형된 시스톨릭 배열에 기반한 행렬 연산 유닛을 제공함으로써 기존에 2방향으로 데이터가 입력되던 행렬 연산기에 비해 높은 연산 처리량과 성능을 얻을 수 있다.According to the above-described problem solving means of the present invention, it is possible to input data in four directions of the matrix calculation unit, and by providing a matrix calculation unit based on a modified systolic arrangement that limits data transmission based on a data transmission boundary line, the existing In this case, it is possible to obtain high computational throughput and performance compared to a matrix operator in which data was input in two directions.

이러한 본 발명은 행렬 연산 유닛 내부에 불필요한 로직이 추가되지 않고, 데이터 전송 경계선에 위치한 셀에만 연산로직이 추가 배치되는 하드웨어 구조 변경으로 인한 칩 면적, 소모전력에 의한 오버헤드가 거의 발생하지 않으면서, 4방향으로 데이터 입력이 가능하도록 입력 데이터를 구성할 수 있어 행렬 연산이 가속될 수 있다.In the present invention, unnecessary logic is not added inside the matrix operation unit, and the overhead due to the power consumption and the chip area due to the hardware structure change in which the calculation logic is additionally disposed only in the cell located at the data transmission boundary, does not occur. The input data can be configured to enable data input in four directions, so matrix operations can be accelerated.

도 1은 일반적인 시스톨릭 배열 기반의 행렬 연산기의 구조를 설명하는 도면이다.
도 2는 도 1의 행렬 연산기의 임계 경로를 설명하는 도면이다.
도 3은 일반적인 N x N 행렬 연산을 설명하는 도면이다.
도 4는 도 1의 행렬 연산기의 연산 방식을 설명하는 도면이다.
도 5는 본 발명의 일 실시예에 따른 머신러닝 가속기의 행렬 연산 방법을 구현하기 위한 N x N 개의 연산기 구조를 나타낸 도면이다.
도 6은 도 5의 일부 구성요소인 연산기의 구성을 설명하는 블록도이다.
도 7은 본 발명의 일 실시예에 따른 머신러닝 가속기의 행렬 연산 방법을 나타낸 동작 흐름도이다.
도 8은 본 발명의 일 실시예에 따른 행렬 연산 유닛의 임계 경로를 설명하는 도면이다.
도 9는 본 발명의 일 실시예에 따른 행렬 연산 유닛의 연산 방식을 설명하는 도면이다.1 is a diagram illustrating a structure of a general systolic array-based matrix operator.
FIG. 2 is a diagram illustrating a critical path of the matrix operator of FIG. 1.
3 is a diagram illustrating a general N x N matrix operation.
4 is a view for explaining the calculation method of the matrix operator of FIG. 1.
5 is a diagram showing the structure of N x N operators for implementing a matrix calculation method of a machine learning accelerator according to an embodiment of the present invention.
FIG. 6 is a block diagram illustrating a configuration of an operator that is a part of FIG. 5.
7 is an operation flowchart showing a matrix calculation method of a machine learning accelerator according to an embodiment of the present invention.
8 is a diagram illustrating a critical path of a matrix operation unit according to an embodiment of the present invention.
9 is a diagram illustrating a calculation method of a matrix operation unit according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art to which the present invention pertains may easily practice. However, the present invention can be implemented in many different forms and is not limited to the embodiments described herein. In addition, in order to clearly describe the present invention in the drawings, parts irrelevant to the description are omitted, and like reference numerals are assigned to similar parts throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미하며, 하나 또는 그 이상의 다른 특징이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Throughout the specification, when a part is "connected" to another part, this includes not only "directly connected" but also "electrically connected" with another element in between. . Also, when a part is said to "include" a certain component, it means that the component may further include other components, not exclude other components, unless specifically stated otherwise. However, it should be understood that the existence or addition possibilities of numbers, steps, actions, components, parts or combinations thereof are not excluded in advance.

이하의 실시예는 본 발명의 이해를 돕기 위한 상세한 설명이며, 본 발명의 권리 범위를 제한하는 것이 아니다. 따라서 본 발명과 동일한 기능을 수행하는 동일 범위의 발명 역시 본 발명의 권리 범위에 속할 것이다.The following examples are detailed descriptions to aid the understanding of the present invention, and do not limit the scope of the present invention. Therefore, the same scope of the invention performing the same function as the present invention will also belong to the scope of the present invention.

이하 첨부된 도면을 참고하여 본 발명의 일 실시예를 상세히 설명하기로 한다.Hereinafter, an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 5는 본 발명의 일 실시예에 따른 머신러닝 가속기의 행렬 연산 방법을 구현하기 위한 행렬 연산기 구조를 나타낸 도면이고, 도 6은 도 5의 일부 구성요소인 연산기의 구성을 설명하는 블록도이다. FIG. 5 is a diagram illustrating a structure of a matrix operator for implementing a matrix calculation method of a machine learning accelerator according to an embodiment of the present invention, and FIG. 6 is a block diagram illustrating a configuration of an operator, which is a part of FIG. 5.

도 5를 참조하면, 본 발명에서 제시하는 머신러닝 가속기의 행렬 연산 유닛(100)은2방향 또는 4방향에서 데이터 입력이 이루어질 수 있는 N x N 개의 연산기를 포함하고 있다.Referring to FIG. 5, the matrix computing unit 100 of the machine learning accelerator proposed by the present invention includes N × N operators capable of data input in two or four directions.

이때, 제1 방향(West)에서 입력된 데이터는 매 사이클마다 제3 방향으로 전송되고, 제2 방향(North)에서 입력된 데이터는 매 사이클마다 제4 방향으로 전송되며, 제3 방향(East)에서 입력된 데이터는 매 사이클마다 제1 방향으로 전송되고, 제4 방향(South)에서 입력된 데이터는 매 사이클마다 제2 방향으로 전송된다. 즉 각각의 방향에서 입력된 데이터는 입력 방향과 대향되는 방향으로 전송된다. At this time, data input in the first direction (West) is transmitted in the third direction every cycle, data input in the second direction (North) is transmitted in the fourth direction every cycle, and the third direction (East) The data input in is transmitted in the first direction every cycle, and the data input in the fourth direction (South) is transmitted in the second direction every cycle. That is, data input in each direction is transmitted in a direction opposite to the input direction.

한편, 머신러닝 가속기는 행렬 연산 유닛의 제1 방향, 제2 방향, 제3 방향 및 제4 방향에서 전송되는 각각의 데이터가 교차되는 영역을 데이터 전송 경계선으로 설정한다. N x N 개의 연산기는 데이터 전송 경계선을 기준으로 제1 구역과 제2 구역으로 구분되는데, 제1 구역과 제2 구역 간에 데이터 전송을 제한한다. 이때, 제1 구역은 북서쪽 구역, 제2 구역은 남동쪽 구역이 될 수 있다. On the other hand, the machine learning accelerator sets a region where each data transmitted in the first direction, the second direction, the third direction, and the fourth direction of the matrix operation unit intersects with the data transmission boundary line. The N x N operators are divided into a first zone and a second zone based on a data transmission boundary line, limiting data transmission between the first zone and the second zone. At this time, the first zone may be a northwestern zone and the second zone may be a southeastern zone.

따라서, 제1 구역의 연산기들은 연산 로직을 통해 입력되는 데이터를 이용한 연산을 수행한 후 해당 데이터를 남쪽(제4 방향) 또는 동쪽(제3 방향) 방향으로 전송하고, 제2 구역의 연산기들은 연산 로직을 통해 입력되는 데이터를 이용한 연산을 수행한 후 해당 데이터를 북쪽(제2 방향) 또는 서쪽(제1 방향) 방향으로 전송한다. Therefore, the operators in the first zone perform calculations using data input through the calculation logic, and then transmit the corresponding data in the south (fourth direction) or east (third direction) direction, and the operators in the second zone operate After performing an operation using data input through logic, the corresponding data is transmitted in the north (second direction) or west (first direction) direction.

이때, 데이터 전송 경계선에 위치한 연산기들은 제1 구역에서 전송되는 데이터와 제2 구역에서 전송되는 데이터가 교차되는 지점이므로, 제1 구역에서 전송되는 데이터를 이용한 연산 및 제2 구역에서 전송되는 데이터를 이용한 연산을 각각 수행할 수 있도록2개의 연산 로직이 배치된다. At this time, since the operators located on the data transmission boundary line are the points where data transmitted from the first zone and data transmitted from the second zone cross each other, calculation using data transmitted from the first zone and data transmitted from the second zone are used. Two computational logics are arranged to perform each computation.

즉, 도 6의 (a)와 (b)에 도시된 바와 같이, 제1 구역과 제2 구역에 위치한 연산기들(110)은 곱셈기(111), 덧셈기(112), 버퍼(113), 전송기(Forwarding unit)을 포함한 복수의 유닛(114)으로 구성된다.That is, as shown in (a) and (b) of FIG. 6, the operators 110 located in the first zone and the second zone are multiplier 111, adder 112, buffer 113, and transmitter ( It consists of a plurality of units 114 including a forwarding unit.

그러나, 도 6의 (c)에 도시된 바와 같이, 데이터 전송 경계선에 위치한 연산기들(120)은 제1 구역에서 전송되는 데이터를 연산하기 위한 제1 곱셈기(121)와 제1 덧셈기(122), 제2 구역에서 전송되는 데이터를 연산하기 위한 제2 곱셈기(124)와 제2 덧셈기(125), 버퍼(123), 전송기를 포함한 기타 유닛들(126)로 구성된다. However, as shown in (c) of FIG. 6, the operators 120 located at the data transmission boundary include a first multiplier 121 and a first adder 122 for calculating data transmitted in the first zone, It consists of a second multiplier 124, a second adder 125, a buffer 123, and other units 126, including a transmitter, for computing data transmitted in the second zone.

도 7은 본 발명의 일 실시예에 따른 머신러닝 가속기의 행렬 연산 방법을 나타낸 동작 흐름도이고, 도 8은 본 발명의 일 실시예에 따른 행렬 연산 유닛의 임계 경로를 설명하는 도면이며, 도 9는 본 발명의 일 실시예에 따른 행렬 연산 유닛의 연산 방식을 설명하는 도면이다.7 is an operation flowchart showing a matrix operation method of a machine learning accelerator according to an embodiment of the present invention, FIG. 8 is a view illustrating a critical path of a matrix operation unit according to an embodiment of the present invention, and FIG. 9 is It is a diagram for explaining a calculation method of a matrix operation unit according to an embodiment of the present invention.

도 7 내지 도 9를 참조하면, 머신러닝 가속기의 행렬 연산 방법은, 행렬 연산 유닛의 4방향에서 전송되는 각각의 데이터가 교차되는 영역을 데이터 전송 경계선으로 설정하고, 데이터 전송 경계선을 기준으로 제1 구역과 제2 구역 간에 데이터 전송을 제한하도록 설정한다. (S1).7 to 9, in the matrix calculation method of the machine learning accelerator, an area in which each data transmitted in the four directions of the matrix calculation unit intersects is set as a data transmission boundary line, and the first is based on the data transmission boundary line. It is set to limit data transmission between the zone and the second zone. (S1).

머신러닝 가속기는 N x N 개의 연산기로 이루어진 행렬 연산 유닛의 제1 방향, 제2 방향, 제3 방향 및 제4 방향 중 2방향 또는 4방향에서 입력 데이터를 입력한다(S2). The machine learning accelerator inputs input data in two or four of the first direction, the second direction, the third direction, and the fourth direction of the matrix operation unit consisting of N x N operators (S2).

각 연산기는 매 사이클마다 입력 데이터를 이용하여 연산을 수행하고(S3), 각 연산기의 연산 결과에 대한 데이터(출력 값)를 다음 행 또는 열의 연산기의 입력값으로 전송한다(S4).Each operator performs an operation using input data every cycle (S3), and transmits data (output value) for the operation result of each operator as the input value of the next row or column of the operator (S4).

최종적으로, 행렬 연산 유닛은 각 연산 결과를 열 방향으로 출력하고, 합산기에 의해 누적된 연산 결과들이 활성화(activation), 정규화(Normalization) 및 풀링(pooling) 과정을 거쳐 저장되고, 이렇게 저장된 값은 행렬 연산 유닛의 입력으로 재사용되거나 인터페이스를 통해 메인 메모리로 출력된다(S5).Finally, the matrix operation unit outputs each operation result in the column direction, and the operation results accumulated by the adder are stored through activation, normalization, and pooling, and the stored value is a matrix It is reused as the input of the calculation unit or output to the main memory through the interface (S5).

이때, 데이터 전송 경계선에 위치한 연산기들은 제1 구역에서 전송되는 데이터를 이용한 연산 및 제2 구역에서 전송되는 데이터를 이용한 연산을 각각 수행하고, 데이터 전송 경계선을 기준으로 제1 구역의 데이터가 제2 구역으로 전송되지 않도록 하며, 제2 구역의 데이터가 제1 구역으로 전송되지 않도록 데이터 전송 방향을 제한한다. At this time, operators located at the data transmission boundary perform calculations using data transmitted from the first zone and calculations using data transmitted from the second zone, respectively, and the data in the first zone is the second zone based on the data transmission boundary. The data transmission direction is restricted so that data in the second zone is not transmitted to the first zone.

도 8 및 도 9에 도시된 바와 같이, 새롭게 변형된 시스톨릭 배열에 기반한 행렬 연산 유닛은 기존의 시스톨릭 배열에 기반한 행렬 연산기와 달리 입력 데이터가 제1 방향(서쪽 방향) 및 제2 방향(북쪽 방향)에서 입력될 뿐만 아니라 제3 방향(동쪽 방향) 및 제4 방향(남쪽 방향)에서도 입력되고, 데이터 전송 경계선을 기준으로 제1 구역(북서쪽 구역)의 데이터는 남쪽 또는 동쪽 방향으로 전송되며, 제2 구역(남동쪽 구역)의 데이터는 북쪽 또는 서쪽 방향으로 전송된다. As shown in FIGS. 8 and 9, in the matrix operation unit based on the newly modified systolic array, input data has a first direction (west direction) and a second direction (north) unlike a matrix operation unit based on a conventional systolic array. Direction) as well as in the third direction (east direction) and in the fourth direction (south direction), and the data in the first zone (northwest zone) based on the data transmission boundary is transmitted in the south or east direction, Data in the second zone (southeast zone) is transmitted in the north or west direction.

도 9에 도시된 바와 같이, 제1 방향(서쪽)의 우측 상단의 연산기와 제2 방향(북쪽)의 좌측 하단의 연산기에 각 데이터(a₀₀, b₀₀)가 입력되는 경우에, 도 5에 도시된 바와 같은 데이터 전송 경계선이 설정되고, 대기 상태가 없도록 제3 방향(동쪽)의 좌측 하단의 연산기와 제4 방향(남쪽)의 우측 상단의 연산기에 각 데이터(a_(N-1)(N-1), b_(N-1)(N-1))가 입력된다. 이때, 데이터 전송 경계선은 데이터 입력 위치에 따라 달라질 수 있으며, 행렬 연산 유닛의 4방향에서 데이터가 입력될 경우 데이터 전송 경계선을 기준으로 양쪽 끝단에 위치한 연산기에서 대칭적으로 데이터 입력이 수행되도록 한다. As illustrated in FIG. 9, when each data (a ₀₀ , b ₀₀ ) is input to the operator in the upper right of the first direction (west) and the lower left of the second direction (north), in FIG. 5 Data transmission boundary lines are set as shown, and each data (a _{(N-1) (N} ) is provided to the operator in the lower left of the third direction (east) and the upper right of the fourth direction (south) so that there is no waiting state. _-1) and b _{(N-1) (N-1)} ) are input. At this time, the data transmission boundary line may vary depending on the data input location, and when data is input in four directions of the matrix operation unit, data input is symmetrically performed at an operator located at both ends based on the data transmission boundary line.

1회의 사이클마다 각 연산기는 입력되는 데이터를 이용하여 곱셈 또는 덧셈 등의 연산을 수행하는데, 데이터 전송 경계선에 위치한 연산기들은 1회의 사이클마다 2회의 연산을 수행한다. For each cycle, each operator performs multiplication or addition operations using the input data. The operators located at the data transmission boundary perform two operations for each cycle.

N x N 행렬 연산을 수행할 때, 일반적인 시스톨릭 배열에 기반한 행렬 연산기는 2N-1 사이클의 크기의 입력 데이터를 구성할 수 있지만, 본 발명의 행렬 연산 유닛은 2N-2 사이클 크기의 입력 데이터를 구성할 수 있다. 따라서, 임계 경로를 비교해보면 일반적인 시스톨릭 배열에 기반한 행렬 연산기는 3N-2 사이클로 1회의 행렬 연산이 발생하지만, 본 발명의 행렬 연산 유닛은 2N-2 사이클로 1회의 행렬 연산이 발생하므로 매 행렬 연산마다 N 사이클의 차이가 발생한다. When performing an N x N matrix operation, a matrix operator based on a typical systolic array can construct input data of a size of 2N-1 cycles, but the matrix operation unit of the present invention can input 2N-2 cycle size of input data. Can be configured. Therefore, when comparing the critical paths, a matrix operator based on a general systolic array generates one matrix operation in 3N-2 cycles, but the matrix operation unit of the present invention occurs once in 2N-2 cycles, so every matrix operation The difference of N cycles occurs.

본 발명에 의한 머신러닝 가속기는 데이터가 입력되면 연산기 내부에 현재 입력된 데이터를 다음 셀(즉, 연산기)로 전송할지를 결정하는 로직이 추가로 필요하지 않기 때문에 불필요한 로직이 추가되지 않지만, 데이터 전송 경계선에 위치한 셀에만 연산로직이 추가되므로, 이러한 간단한 하드웨어 구조 변경으로 인한 칩 면적, 소모전력에 의한 오버헤드가 거의 발생하지 않는다. In the machine learning accelerator according to the present invention, unnecessary logic is not added because data that is inputted into the next cell (that is, an operator) is not required because additional logic is not required, but when data is input, unnecessary logic is not added. Since the computational logic is added only to the cell located in, there is almost no overhead due to the chip area and power consumption due to the simple hardware structure change.

이상에서 설명한 본 발명의 실시예에 따른 머신러닝 가속기의 행렬 연산 방법은, 컴퓨터에 의해 실행되는 프로그램 모듈과 같은 컴퓨터에 의해 실행 가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 이러한 기록 매체는 컴퓨터 판독 가능 매체를 포함하며, 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체를 포함하며, 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다.The matrix calculation method of the machine learning accelerator according to the embodiment of the present invention described above may also be implemented in the form of a recording medium including instructions executable by a computer, such as a program module executed by a computer. Such recording media include computer readable media, which can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media. In addition, computer readable media includes computer storage media, which are volatile and nonvolatile implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. , Removable and non-removable media.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The above description of the present invention is for illustration only, and those skilled in the art to which the present invention pertains can understand that the present invention can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is indicated by the following claims rather than the above detailed description, and it should be interpreted that all changes or modified forms derived from the meaning and scope of the claims and equivalent concepts thereof are included in the scope of the present invention. do.

100: 행렬 연산 유닛100: matrix operation unit

Claims

In the matrix calculation method of a machine learning accelerator comprising a matrix calculation unit consisting of N x N operators,
a) Set an area where each data transmitted in the first direction, the second direction, the third direction, and the fourth direction of the matrix operation unit intersects as a data transmission boundary line, and a first zone based on the data transmission boundary line. Limiting data transmission between the first zone and the second zone after dividing into a second zone;
b) inputting input data from at least two of the first direction, the second direction, the third direction, and the fourth direction;
c) the N x N operators perform arithmetic processing using the input data every cycle, and the matrix operation unit includes summing the calculation results of each operator and outputting them in a column direction.
The operators located at the data transmission boundary perform computation using data transmitted in the first zone and calculation using data transmitted in the second zone, respectively.

According to claim 1,
Step c),
After each operator performs the calculation using the input data, the output value corresponding to the calculation result is transmitted as input data of the next row or column operator,
A method of matrix computation of a machine learning accelerator, in which data is transmitted in a direction opposite to the direction in which the input data is inputted until it reaches an operator located at the data transmission limit line.

According to claim 1,
The operators located in the first zone and the second zone include one arithmetic logic including a multiplier and an adder,
The operators located at the data transmission boundary include two calculation logics, the method of matrix calculation of the machine learning accelerator.

According to claim 1,
The matrix operation unit is to perform a matrix operation by constructing input data having a size of 2N-2 cycles, the matrix operation method of the machine learning accelerator.

In a machine learning accelerator that performs N x N matrix operations,
Including at least two of the first direction, the second direction, the third direction and the fourth direction, the input data is input, and includes a matrix operation unit consisting of N x N operators to perform a matrix operation using the input data, ,
The matrix operation unit sets an area where each data transmitted in the first direction, the second direction, the third direction, and the fourth direction intersects as a data transmission boundary line, and a first area and a first area based on the data transmission boundary line. After dividing into two zones, data transmission is restricted between the first zone and the second zone,
The machine learning accelerators of the matrix operation units, located at the data transmission boundary, perform operations using data transmitted from the first zone and operations using data transmitted from the second zone, respectively.

In the fifth,
Operators located in the first zone and the second zone include a multiplier, an adder, a buffer, and a transmitter,
Operators located at the data transmission boundary include a plurality of multipliers, a plurality of adders, a buffer and a transmitter, a machine learning accelerator.