KR20200062743A

KR20200062743A - Method and reconfigurable interconnect topology for multi-dimensional parallel training of convolutional neural network

Info

Publication number: KR20200062743A
Application number: KR1020180148497A
Authority: KR
Inventors: 김동준; 홍병철
Original assignee: 한국과학기술원
Priority date: 2018-11-27
Filing date: 2018-11-27
Publication date: 2020-06-04
Also published as: KR102163209B1

Abstract

Disclosed is a reconfigurable interconnect topology between a method for multi-dimensional parallel training of convolutional neural networks and an apparatus for performing the same. According to an embodiment of the present invention, the method for multi-dimensional parallel training for a Winograd conversion convolution comprises the following steps of: converting input data including an input characteristics map into a plurality of tiles; distributing and delivering the converted tiles to a plurality of workers arranged in two dimensions by a plurality of clusters and a plurality of groups, wherein the converted tiles are distributed and delivered for each of the clusters; performing data parallelism on the input data distributed and delivered, through workers of each of the clusters; and performing intra-tile parallelism on an element unit of the tiles applied to the groups, respectively, among the input data distributed and delivered through the workers of each of the clusters.

Description

METHOD AND RECONFIGURABLE INTERCONNECT TOPOLOGY FOR MULTI-DIMENSIONAL PARALLEL TRAINING OF CONVOLUTIONAL NEURAL NETWORK}

아래의 실시예들은 컨볼루션 신경망 훈련의 다차원 병렬화 방법과 이를 수행하는 장치 사이의 재구성 가능한 연결 구조에 관한 것이다. The following embodiments relate to a multidimensional parallelization method of convolutional neural network training and a reconfigurable connection structure between devices performing the same.

가속 신경망 학습은 신경망의 설계 공간을 탐색하는 데 매우 중요하다. 입력 배치(batch)가 다중 작업자에게 분산되는 CNN(Convolutional Neural Networks)의 학습을 가속화하기 위해 데이터 병렬 훈련이 사용되는 것이 일반적이다. 예를 들어, 한국공개특허 제10-2016-0144467호는 컨볼루션 신경망들의 트레이닝을 병렬화하기 위한 기술을 개시하고 있다. 그러나, 가중치 그라디언트의 통신 시간은 배치 크기가 제한되는 경우, 병렬 훈련의 확장성(scalability)을 제한할 수 있다. Accelerated neural network learning is very important for exploring the design space of neural networks. Data parallel training is commonly used to accelerate the learning of Convolutional Neural Networks (CNNs) where input batches are distributed across multiple workers. For example, Korean Patent Publication No. 10-2016-0144467 discloses a technique for parallelizing the training of convolutional neural networks. However, the communication time of the weight gradient may limit the scalability of parallel training when the batch size is limited.

기존에 사용되던 데이터 병렬 훈련과 위노그라드 도메인에서 사용할 수 있는 인트라-타일(또는 요소 단위) 병렬 처리를 모두 활용하는 위노그라드 도메인 컨볼루션의 새로운 다차원 병렬 학습(Multi-dimensional Parallel Training, MPT)을 제공하는 다차원 병렬 학습 방법 및 상기 다차원 병렬 학습 방법을 수행하는 컴퓨터 장치를 제공한다.Providing a new multi-dimensional parallel training (MPT) of Winnograd domain convolution that utilizes both the existing data parallel training and intra-tile (or element-wise) parallel processing available in the Winograd domain. It provides a multi-dimensional parallel learning method and a computer device for performing the multi-dimensional parallel learning method.

MPT를 사용하여 작업자 간의 통신을 효율적으로 지원하기 위해, 데이터 병렬 처리에서 집단 통신을 위한 링 토폴로지 및 타일 전송에 높은 연결 토폴로지를 사용하는 하이브리드 토폴로지를 가진 작업자의 2D 조직을 제공하는 다차원 병렬 학습 방법 및 상기 다차원 병렬 학습 방법을 수행하는 컴퓨터 장치를 제공한다.Multi-dimensional parallel learning method to provide 2D organization of workers with hybrid topology using ring topology for collective communication and high connection topology for tile transmission in order to efficiently support communication between workers using MPT and A computer device for performing the multidimensional parallel learning method is provided.

MPT 구조에서 작업자가 두 가지 유형의 통신(가중치 그라디언트 및 타일 전송)의 균형을 최적화할 수 있도록 하는 동적 클러스터링을 제공하는 다차원 병렬 학습 방법 및 상기 다차원 병렬 학습 방법을 수행하는 컴퓨터 장치를 제공한다.Provided is a multi-dimensional parallel learning method that provides dynamic clustering that enables an operator to optimize the balance of two types of communication (weighted gradient and tile transmission) in an MPT structure, and a computer device that performs the multi-dimensional parallel learning method.

타일 수집 통신에 필요한 대역폭을 감소시키기 위해 정확도 손실 없이 위노그라드 도메인 데이터의 양자화된 값으로부터 공간 도메인(spatial domain) 뉴런의 활성화 예측을 제공하는 다차원 병렬 학습 방법 및 상기 다차원 병렬 학습 방법을 수행하는 컴퓨터 장치를 제공한다.A multi-dimensional parallel learning method providing a prediction of activation of spatial domain neurons from quantized values of the Winograd domain data without loss of accuracy in order to reduce the bandwidth required for tile collection communication and a computer device performing the multi-dimensional parallel learning method Provides

위노그라드(Winograd) 변환 컨볼루션을 위한 다차원 병렬 학습 (Multi-dimensional Parallel Training) 방법에 있어서, 입력 특성 맵을 포함하는 입력 데이터를 복수의 타일로 변환하는 단계; 복수의 클러스터와 복수의 그룹에 의해 2차원으로 배치된 복수의 작업자들에게 상기 변환된 복수의 타일들을 분산 전달하되, 상기 변환된 복수의 타일들을 상기 복수의 클러스터별로 분산하여 전달하는 단계; 상기 복수의 클러스터 각각의 작업자들을 통해, 상기 분산 전달된 입력 데이터에 대한 데이터 병렬 처리(data parallelism)를 수행하는 단계; 및 상기 복수의 그룹 각각의 작업자들을 통해, 상기 분산 전달된 입력 데이터 중 상기 복수의 그룹 각각에 적용되는 상기 복수의 타일들의 요소 단위에 대한 인트라-타일 병렬 처리(intra-tile parallelism)를 수행하는 단계를 포함하는 다차원 병렬 학습 방법을 제공한다.A multi-dimensional parallel training method for Winograd transform convolution, the method comprising: converting input data including an input characteristic map into a plurality of tiles; Distributing and transmitting the converted plurality of tiles to a plurality of workers arranged in two dimensions by a plurality of clusters and a plurality of groups, and distributing the converted plurality of tiles for each of the plurality of clusters; Performing data parallelism on the distributedly transmitted input data through each worker of the plurality of clusters; And performing intra-tile parallelism on an element unit of the plurality of tiles applied to each of the plurality of groups among the plurality of groups of distributed input data through each of the plurality of groups of workers. It provides a multi-dimensional parallel learning method comprising a.

일측에 따르면, 상기 인트라 타일 병렬 처리를 수행하는 단계는, 위노그라드 도메인 내적에서 상기 복수의 타일들 내부의 각각의 동일한 위치에 배치된 요소 단위의 내적을 통해 상기 인트라-타일 병렬 처리를 수행하는 것을 특징으로 할 수 있다.According to one side, the step of performing the intra-tile parallel processing may include performing the intra-tile parallel processing through a dot product of element units disposed at each of the same locations in the plurality of tiles in the Winograd domain dot product. It can be characterized as.

다른 측면에 따르면, 상기 복수의 그룹 각각에 포함된 작업자들은 가중치의 집단 연산을 지원하는 링 토폴로지를 통해 상호 연결되고, 상기 복수의 클러스터 각각에 포함된 작업자들은 타일 수집/분산을 위해 올-투-올 트래픽(all-to-all traffic)을 효율적으로 지원하는 고연결성(high-connectivity)의 토폴로지(일례로, 2D FBFLY(flattened butterfly) 토폴로지 또는 드래곤플라이(Dragonfly) 토폴로지)를 통해 상호 연결되는 것을 특징으로 할 수 있다.According to another aspect, workers included in each of the plurality of groups are interconnected through a ring topology supporting collective operation of weights, and workers included in each of the plurality of clusters are all-to-one for tile collection/distribution. Features that are interconnected via a high-connectivity topology (e.g., 2D flattened butterfly (FBFLY) topology or Dragonfly topology) that efficiently supports all-to-all traffic Can be done with

또 다른 측면에 따르면, 상기 다차원 병렬 학습 방법은 상기 입력 특성 맵의 크기 및 가중치 크기 중 적어도 하나에 따라 상기 복수의 그룹들을 서로 연결하는 호스트를 통해 상기 복수의 클러스터의 수와 상기 복수의 그룹의 수를 동적으로 변경하는 단계를 더 포함할 수 있다.According to another aspect, the multi-dimensional parallel learning method may include the number of the plurality of clusters and the number of the plurality of groups through a host connecting the plurality of groups to each other according to at least one of the size and weight size of the input characteristic map. It may further include the step of dynamically changing.

또 다른 측면에 따르면, 상기 복수의 작업자들 중 원본(source) 작업자가 대상(destination) 작업자에게 실제 값을 전송하기 전에 예측을 위한 타일 요소의 양자화된 값을 전송하고, 상기 대상 작업자가 상기 양자화된 값으로부터 공간 도메인 뉴런의 활성화를 예측하여 불필요한 타일 수집을 스킵하는 것을 특징으로 할 수 있다.According to another aspect, a source operator among the plurality of workers transmits a quantized value of a tile element for prediction before transmitting an actual value to a destination worker, and the target worker quantizes the quantized value. It can be characterized by predicting activation of a spatial domain neuron from a value and skipping unnecessary tile collection.

또 다른 측면에 따르면, 상기 복수의 작업자 각각은 제어부, 연산부 및 통신부를 포함하는 확장 가능한 NDP(Near-Data Processing) 아키텍처로 구현되는 것을 특징으로 할 수 있다.According to another aspect, each of the plurality of workers may be implemented by an expandable NDP (Near-Data Processing) architecture including a control unit, a calculation unit, and a communication unit.

컴퓨터 장치와 결합되어 상기 다차원 병렬 학습 방법을 컴퓨터 장치에 실행시키기 위해 컴퓨터 판독 가능한 기록매체에 저장된 컴퓨터 프로그램을 제공한다.In combination with a computer device, a computer program stored in a computer readable recording medium is provided to execute the multidimensional parallel learning method in a computer device.

상기 다차원 병렬 학습 방법을 컴퓨터 장치에 실행시키기 위한 컴퓨터 프로그램이 기록되어 있는 컴퓨터 판독 가능한 기록매체를 제공한다.A computer-readable recording medium in which a computer program for executing the multidimensional parallel learning method is executed is provided.

위노그라드(Winograd) 변환 컨볼루션을 위한 다차원 병렬 학습(Multi-dimensional Parallel Training) 방법을 수행하는 컴퓨터 장치에 있어서, 상기 컴퓨터 장치에서 판독 가능한 명령을 실행하도록 구현되는 적어도 하나의 프로세서를 포함하고, 상기 적어도 하나의 프로세서에 의해, 입력 특성 맵을 포함하는 입력 데이터를 복수의 타일로 변환하고, 복수의 클러스터와 복수의 그룹에 의해 2차원으로 배치된 복수의 작업자들에게 상기 변환된 복수의 타일들을 분산 전달하되, 상기 변환된 복수의 타일들을 상기 복수의 클러스터별로 분산하여 전달하고, 상기 복수의 클러스터 각각의 작업자들을 통해, 상기 분산 전달된 입력 데이터에 대한 데이터 병렬 처리(data parallelism)를 수행하고, 상기 복수의 그룹 각각의 작업자들을 통해, 상기 분산 전달된 입력 데이터 중 상기 복수의 그룹 각각에 적용되는 상기 복수의 타일들의 요소 단위에 대한 인트라-타일 병렬 처리(intra-tile parallelism)를 수행하는 것을 특징으로 하는 컴퓨터 장치를 제공한다.A computer device performing a multi-dimensional parallel training method for Winograd transform convolution, comprising: at least one processor implemented to execute readable instructions in the computer device; and At least one processor converts input data including an input characteristic map into a plurality of tiles, and distributes the converted plurality of tiles to a plurality of workers arranged in two dimensions by a plurality of clusters and a plurality of groups. Delivered, the converted plurality of tiles are distributed and distributed for each of the plurality of clusters, and data parallelism is performed on the distributed and transmitted input data through each of the plurality of clusters. It is characterized in that intra-tile parallelism is performed on an element unit of the plurality of tiles applied to each of the plurality of groups, among the plurality of groups, through the workers of each of the groups. To provide a computer device.

위노그라드 도메인에서 사용할 수 있는 데이터와 인트라-타일(또는 요소 단위) 병렬 처리를 모두 활용하는 위노그라드 도메인 컨볼루션의 새로운 다차원 병렬 학습(Multi-dimensional Parallel Training, MPT)을 제공할 수 있다.We can provide a new multi-dimensional parallel training (MPT) of Winnograd domain convolution that utilizes both intra-tile (or element-by-element) parallel processing and data available in the Winnograd domain.

MPT를 사용하여 작업자 간의 통신을 효율적으로 지원하기 위해, 데이터 병렬 처리에서 집단 통신을 위한 링 토폴로지 및 타일 전송에 높은 연결 토폴로지를 사용하는 하이브리드 토폴로지를 가진 작업자의 2D 조직을 제공할 수 있다.To effectively support communication between workers using MPT, it is possible to provide a 2D organization of workers with hybrid topologies that use ring topologies for collective communication in data parallel processing and high connection topologies for tile transmission.

MPT 구조에서 작업자가 두 가지 유형의 통신(가중치 그라디언트 및 타일 전송)의 균형을 최적화할 수 있도록 하는 동적 클러스터링(작업자 사이의 연결 구조의 재작성)을 제공할 수 있다.In the MPT architecture, you can provide dynamic clustering (rewrite of the connection structure between workers) that allows the operator to optimize the balance of two types of communication (weighted gradients and tile transfer).

타일 수집에 필요한 대역폭을 감소시키기 위해 정확도 손실 없이 위노그라드 도메인 데이터의 양자화된 값으로부터 공간 도메인 뉴런의 활성화 예측을 제공할 수 있다To reduce the bandwidth required for tile collection, it is possible to provide prediction of activation of spatial domain neurons from quantized values of the Winograd domain data without loss of accuracy.

도 1은 4 Х 4의 다이렉트 컨볼루션과 타일 크기(Wino T4) 및 6 Х 6의 타일 크기(Wino T6)을 갖는 위노그라드 변환 컨볼루션을 위한 연산(FLOPs) (a)와 메모리 액세스의(L1 데이터 캐시 접속 수) (b)의 비교를 위한 그래프들이다.
도 2는 위노그라드 변환을 갖는 컨볼루션 레이어의 세 개의 단계를 도시한 도면이다.
도 3은 위노그라드 도메인에 가중치를 직접 업데이트하는 위노그라드 레이어를 도시한 도면이다.
도 4는 공간 도메인에서의 4D 텐서가 위노그라드 도메인에서 타일들의 내적으로 변환되는 예를 도시한 도면이다.
도 5는 요소 단위 행렬 곱셈으로 표현된 내적의 예를 도시한 도면이다.
도 6은 다이렉트 컨볼루션의 예를 도시한 도면이다.
도 7은 변환 컨볼루션(내적)의 예를 도시한 도면이다.
도 8은 데이터 병렬 학습의 예를 도시한 도면이다.
도 9는 본 발명의 일실시예에 따른 다차원 병렬 학습의 예를 도시한 도면이다.
도 10은 본 발명의 일실시예에 있어서, 두 개의 서로 다른 레이어의 반복마다 작업자 별로 전송하는 통신량을 비교를 그래프들이다.
도 11은 본 발명의 일실시예에 있어서, 서로 다른 병렬 전략을 갖는 FractalNet의 학습을 반복할 때마다 작업자 별로 전송하는 통신량을 도시한 그래프이다.
도 12는 본 발명의 일실시예에 있어서, 동적 클러스터링을 이용한 다양한 병렬화 구성의 예를 도시한 도면이다.
도 13은 본 발명의 일실시예에 있어서, p=256, N _g =16, N _c =16을 갖는 확장 가능한 NDP 아키텍처의 하이-레벨 블록 다이어그램 (a) 및 동적 클러스터링으로부터의 세 가지의 서로 다른 구성 (b), (C), (D)를 도시한 도면이다.
도 14는 본 발명의 일실시예에 있어서, 활성화를 예측하는 위노그라드 도메인 타일 값의 양자화와 관련하여 위노그라드 도메인 요소의 비-균일 양자화의 예를 도시한 도면이다.
도 15는 본 발명의 일실시예에 있어서, 활성화를 예측하는 위노그라드 도메인 타일 값의 양자화와 관련하여 단정도 플로트(single-precision float)를 위한 양자화 로직의 예를 도시한 도면이다.
도 16은 본 발명의 일실시예에 있어서, 양자화를 이용한 활성화 예측 흐름의 예를 도시한 도면이다.
도 17은 본 발명의 일실시예에 있어서, 두 개의 서로 다른 데이터 셋을 위한 비활성화 타일 / 라인의 실제 및 예측된 비율의 예를 도시한 그래프들이다.
도 18은 본 발명의 일실시예에 있어서, 제어, 연산 및 통신 유닛들로 구성되는, 각 메모리 모듈의 로직 레이어에 추가되는 NDP의 예를 도시한 블록도이다.
도 19는 본 발명의 일실시예에 있어서, P2P 통신 로직의 예를 도시한 블록도이다.
도 20은 본 발명의 일실시예에 있어서, 집단 통신 로직의 예를 도시한 블록도이다.
도 21은 본 발명의 일실시예에 따른 컴퓨터 장치의 예를 도시한 블록도이다.
도 22는 본 발명의 일실시예에 따른 데이터 처리 방법의 예를 도시한 흐름도이다.1 shows operations (FLOPs) (a) and memory access (L1) for Winograd transform convolution with direct convolution of 4 x 4 and tile size (Wino T4) and tile size (Wino T6) of 6 x 6 These are graphs for comparison of (b) data cache connections.
FIG. 2 is a diagram showing three steps of a convolutional layer having a Winograd transformation.
FIG. 3 is a diagram illustrating a winograd layer directly updating a weight in a winograd domain.
FIG. 4 is a diagram illustrating an example in which a 4D tensor in a spatial domain is internally transformed of tiles in a Winograd domain.
5 is a diagram illustrating an example of a dot product represented by element unit matrix multiplication.
6 is a diagram showing an example of direct convolution.
7 is a diagram showing an example of transform convolution (inner product).
8 is a diagram illustrating an example of data parallel learning.
9 is a diagram illustrating an example of multidimensional parallel learning according to an embodiment of the present invention.
10 is a graph illustrating comparison of traffic transmitted for each operator for repetition of two different layers in one embodiment of the present invention.
FIG. 11 is a graph illustrating the traffic transmitted for each worker each time the learning of FractalNet having different parallel strategies is repeated, according to an embodiment of the present invention.
12 is a diagram illustrating an example of various parallelization configurations using dynamic clustering in an embodiment of the present invention.
13 is a high-level block diagram (a) of a scalable NDP architecture with p =256, N _g =16, N _c =16, and three different from dynamic clustering in one embodiment of the present invention. It is the figure which showed the structure (b), (C), (D).
14 is a diagram illustrating an example of non-uniform quantization of a Winograd domain element with respect to quantization of a Winograd domain tile value predicting activation in an embodiment of the present invention.
15 is a diagram illustrating an example of quantization logic for a single-precision float in relation to quantization of a Winnograd domain tile value predicting activation in an embodiment of the present invention.
16 is a diagram illustrating an example of an activation prediction flow using quantization in an embodiment of the present invention.
17 is graphs showing examples of actual and predicted ratios of inactive tiles/lines for two different data sets in one embodiment of the present invention.
18 is a block diagram illustrating an example of an NDP added to a logic layer of each memory module, which is composed of control, operation, and communication units, in one embodiment of the present invention.
19 is a block diagram illustrating an example of P2P communication logic in an embodiment of the present invention.
20 is a block diagram illustrating an example of collective communication logic in an embodiment of the present invention.
21 is a block diagram showing an example of a computer device according to an embodiment of the present invention.
22 is a flowchart illustrating an example of a data processing method according to an embodiment of the present invention.

이하, 첨부된 도면을 참조하여 실시예들을 설명한다. 그러나, 기술되는 실시예들은 여러 가지 다른 형태로 변형될 수 있으며, 본 발명의 범위가 이하 설명되는 실시예들에 의하여 한정되는 것은 아니다. 또한, 여러 실시예들은 당해 기술분야에서 평균적인 지식을 가진 자에게 본 발명을 더욱 완전하게 설명하기 위해서 제공되는 것이다. 도면에서 요소들의 형상 및 크기 등은 보다 명확한 설명을 위해 과장될 수 있다.Hereinafter, embodiments will be described with reference to the accompanying drawings. However, the described embodiments may be modified in various other forms, and the scope of the present invention is not limited by the embodiments described below. In addition, various embodiments are provided to more fully describe the present invention to those skilled in the art. The shape and size of elements in the drawings may be exaggerated for a more clear description.

가속 신경망 학습은 신경망의 설계 공간을 탐색하는 데 매우 중요하다. 입력 배치(batch)가 다중 작업자에게 분산되는 CNN(Convolutional Neural Networks)의 학습을 가속화하기 위해 데이터 병렬 처리가 사용되는 것이 일반적이다. 그러나, 가중치 그라디언트의 통신 시간은 적당한 배치 크기의 확장성(scalability)을 제한할 수 있다. 본 발명의 실시예들에서는, 위노그라드(Winograd) 변환 컨볼루션에서 사용할 수 있는 데이터 병렬 처리와 인트라 타일(Intra-tile) 병렬 처리를 모두 활용하는 컨볼루션 레이어의 다차원 병렬 학습(Multi-dimensional Parallel Training, MPT)을 제안한다. 작업자들은 두 개의 차원으로 구성된다. 하나는 인트라 타일 병렬 처리를 사용하는 반면, 다른 하나는 데이터 병렬 처리를 이용한다. MPT는 가중치 그라디언트가 데이터 병렬 처리 차원에서만 통신되기 때문에 가중치 그라디언트에 필요한 통신량을 감소시킨다. 그러나, 위노그라드 변환은 근본적으로 더 많은 데이터 액세스를 요구하며, 제안된 MPT 구조 또한, 위노그라드 차원 기능 맵(tiles)의 타일 전송/수집/스캐터(scatter)로 지칭하는 새로운 유형의 통신을 도입한다. 일실시예에서는, 메모리 중심 네트워크 조직을 활용하여 타일 전송을 가속화하기 위해 작업자간에 높은 연결성을 제공함과 동시에, 3D 스택 메모리를 통한 데이터 액세스 비용을 최소화하는 확장 가능한 근접-데이터 처리(Near-Data Processing, NDP) 구조를 제안한다. 통신 오버 헤드를 수집하는 타일을 최소화하기 위해, 일실시예에서는 공간 도메인 뉴런의 활성화 예측을 활용 비-활성화 뉴런으로 변환된 타일의 통신을 제거한다. 또한, 일실시예에서는 각각의 컨볼루션 레이어에 대한 작업자 간의 연결 토폴로지를 재구성하는 메모리 중심 네트워크 구조의 동적 클러스터링을 제안하여 가중치 그라디언트 및 타일 전송에 필요한 통신의 균형을 맞춘다. 일실시예에서의 평가에 따르면, NDP 구조를 사용하는 제안된 MPT는 NDP 구조 및 다중-GPU(Graphics Processing Unit) 시스템의 데이터 병렬 학습과 비교하여 최대 2.7×, 21× 까지 학습을 가속화한다.Accelerated neural network learning is very important for exploring the design space of neural networks. Data parallel processing is commonly used to accelerate the learning of Convolutional Neural Networks (CNNs) where input batches are distributed across multiple workers. However, the communication time of the weight gradient can limit the scalability of a suitable batch size. In embodiments of the present invention, multi-dimensional parallel training of a convolutional layer utilizing both data parallel processing and intra-tile parallel processing that can be used in Winograd transform convolution , MPT). Workers consist of two dimensions. One uses intra-tile parallel processing, while the other uses data parallel processing. MPT reduces the amount of communication required for the weight gradient because the weight gradient is communicated only in the data parallel processing dimension. However, the Winograd transformation essentially requires more data access, and the proposed MPT structure also introduces a new type of communication called tile transfer/gather/scatter of the Winograd dimensional functional maps. do. In one embodiment, scalable near-data processing (LIM) that utilizes a memory-centric network organization to provide high connectivity between operators to accelerate tile transfer, while minimizing the cost of data access through 3D stack memory. NDP) structure. In order to minimize the tile collecting communication overhead, in one embodiment, the prediction of the activation of spatial domain neurons is used to eliminate the communication of tiles converted to non-activated neurons. In addition, in one embodiment, dynamic clustering of a memory-centric network structure that reconstructs a connection topology between workers for each convolution layer is proposed to balance the communication required for weight gradient and tile transmission. According to the evaluation in one embodiment, the proposed MPT using the NDP structure accelerates learning by up to 2.7×, 21× compared to data parallel learning of the NDP structure and a multi-GPU (Graphics Processing Unit) system.

1. 서론1. Introduction

신경망의 신속한 학습은 신경망의 설계 공간 탐색을 가속화하고 큰 스케일의 신경망 연구를 가능하게 한다. 일실시예에서는 이미지 인식 및 분류에 널리 사용되는 CNN(Convolutional Neural Network)에 초점을 맞춘다. 데이터 병렬 학습은 입력 배치가 여러 작업자에게 분산되는 CNN의 컨볼루션 레이어에 널리 사용된다. 일부 신경망에서 모델 병렬 처리가 이용될 수 있지만, 일실시예에서는 데이터 병렬 처리가 일반적으로 사용될 수 있으며, 데이터 병렬 처리에 초점을 맞춘다. 데이터 병렬 학습에서, 모든 작업자는 입력 배치의 자체 하위 집합에 대한 가중치 그라디언트의 부분 합(가중치와 크기가 동일함)을 생성할 수 있다. 가중치 그라디언트는 모든 작업자에 대해 감소되거나 누적되며, 업데이트된 가중치는 SGD(Stochastic gradient descent)를 사용하여 모든 작업자에게 브로드캐스팅될 수 있다. 작업자 1 인당 생성되는 고정된 양의 데이터를 이용하여, 링 토폴로지(파이프 라인화된 감소/브로드캐스트)에 대한 최적의 집단 통신 알고리즘은 작업자 수에 관계없이 일정한 통신 시간을 가져온다. 그러나, 통신을 위한 최적의(파이프 라인화된) 구현은 고정된 총 배치 크기로 확장성을 제한할 수 있다. 작업자의 수가 증가함에 따라, 작업자 배치 크기 또는 각 작업자에게 할당된 계산량이 감소하며, 통신은 전체 실행 시간 중 더 큰 부분을 나타낸다. 데이터 병렬 처리에 대한 일반적인 접근법 중 하나는 입력 배치 크기를 증가시키는 것이다. 그러나, 이전의 연구는 배치의 큰 크기가 학습된 모델의 품질을 저하시킬 수 있다는 것을 보여 주었고, 수렴 속도가 감소됨을 보여주었다. 일실시예에서는 CNN에서 사용되는 적당한 크기의 배치(즉, 128-256)를 가지는 동기화 SGD를 가정하지만, 확장 가능한 학습을 가능하게 하는 대체 방법을 제안한다.Rapid learning of neural networks accelerates the exploration of neural networks' design space and enables large-scale neural network research. In one embodiment, the focus is on the Convolutional Neural Network (CNN), which is widely used for image recognition and classification. Data parallel learning is widely used in the convolutional layer of CNN, where input batches are distributed among multiple workers. Model parallel processing may be used in some neural networks, but in one embodiment, data parallel processing may be generally used and focuses on data parallel processing. In parallel training of data, every worker can generate a partial sum (equal to weight and size) of weighted gradients for its own subset of input batches. Weight gradients are reduced or accumulated for all workers, and the updated weights can be broadcast to all workers using stochastic gradient descent (SGD). Using a fixed amount of data generated per worker, the optimal collective communication algorithm for the ring topology (pipeline reduction/broadcast) results in a constant communication time regardless of the number of workers. However, an optimal (pipe lined) implementation for communication may limit scalability to a fixed total batch size. As the number of workers increases, the size of the worker batch or the amount of computation allocated to each worker decreases, and the communication represents a larger portion of the overall execution time. One common approach to data parallel processing is to increase the input batch size. However, previous studies have shown that the large size of the batch can degrade the quality of the trained model, and show that the convergence rate is reduced. In one embodiment, a synchronization SGD having an appropriately sized batch (ie, 128-256) used in CNN is assumed, but an alternative method is proposed to enable scalable learning.

특히, 일실시예에서는 새로운 유형의 병렬 처리(위노그라드 변환과 FFT(Fast Fourier Transform)와 같은 변환된 컨볼루션에서 사용할 수 있는 인트라 타일 병렬 처리)를 활용하여 작업자 수가 증가함에 따라 더 나은 확장성을 제공할 수 있다. 변환 컨볼루션은 공간 도메인에서의 컨볼루션 연산을 변환 도메인에서 더 간단한 내적으로 변경함으로써 계산량을 감소시키기 위해 제안되었다. 위노그라드를 변환하면, 특징 맵과 가중치가 각 타일 크기가 T × T인 위노그라드 도메인 타일로 변환된다. 위노그라드 도메인의 원소 단위(Element-wise) 내적은 하나의 큰 행렬 곱셈으로 구성된 다이렉트 컨볼루션과 비교하여 T ²의 독립 행렬 곱셈으로 구성될 수 있다.In particular, one embodiment utilizes a new type of parallelism (intra-tile parallelism that can be used in transformed convolutions such as Winograd transform and Fast Fourier Transform (FFT)) to provide better scalability as the number of workers increases. Can provide. Transform convolution has been proposed to reduce computational complexity by changing the convolution operation in the spatial domain to a simpler dot product in the transform domain. When transforming the Winograd, the feature map and weights are converted into Winograd domain tiles with each tile size T × T. The element-wise dot product of the Winograd domain may be composed of independent matrix multiplication of T ² compared to direct convolution consisting of one large matrix multiplication.

원소 단위(또는 인트라 타일) 독립성을 활용하여, 일실시예에서는 위노그라드 변환 컨볼루션을 위한 데이터 병렬 처리와 인트라 타일 병렬 처리를 결합하는 MPT를 제안한다. 또한, 일실시예에서는 MPT를 적절히 활용하는 2차원 조직을 제안한다. 여기서, 동일한 열의 작업자가 데이터 병렬 처리를 위한 그룹을 구성하는 동시에, 동일한 행의 작업자가 인트라 타일 병렬 처리를 위한 클러스터를 구성할 수 있다. 입력 배치는 클러스터 전체(또는 그룹 내)에 분산되어 있는 동시에, 위노그라드 도메인 타일의 원소는 그룹 전체에 분산되어 있다. 상이한 그룹들(즉, 타일의 상이한 부분들)간에 가중치를 통신할 필요가 없기 때문에, 가중치 감소/브로드캐스트를 위한 통신이 각 그룹 내로 제한(격리)된다. 따라서, 작업자 당 생성되는 가중치 그라디언트의 크기는 그룹 수만큼 감소하며 확장성이 향상될 수 있다.By utilizing elemental unit (or intra-tile) independence, an embodiment proposes an MPT that combines intra-parallel data processing with data parallel processing for convolution of Winogram. In addition, in one embodiment, a two-dimensional organization that properly utilizes MPT is proposed. Here, workers in the same column may form a group for parallel processing of data, and workers in the same row may configure a cluster for intra-tile parallel processing. The input arrangement is distributed throughout the cluster (or in groups), while the elements of the Winograd domain tile are distributed throughout the group. Since there is no need to communicate weights between different groups (ie, different parts of the tile), communication for weight reduction/broadcast is limited (isolated) within each group. Therefore, the size of the weight gradient generated per worker is reduced by the number of groups and scalability can be improved.

MPT는 작업자 간의 가중치 통신(전달)을 감소시키는 반면, 추가 과제를 야기할 수 있다. MPT는 위노그라드 도메인 특성 맵을 수집/스캐터(scatter)시키기 위해 클러스터 내에서 새로운 유형의 통신을 포함하는 데, 본 발명의 실시예들에서는 이를 타일 전송이라 지칭한다. 타일 전송을 효율적으로 지원하기 위해, 본 발명의 실시예들에서는 하이브리드 토폴로지를 제안한다. 하이브리드 토폴로지는 그룹 전체의 타일 통신을 위해 낮은-다이어미터(diameter) (높은-레이딕스(radix)) 토폴로지를 활용한다. 이와 달리, 클러스터를 통한 가중치 통신에 링 토폴로지가 사용된다. 그러나, 최적의 구성(즉, 각 차원에 대한 작업자의 수)은 컨볼루션 레이어의 통신 요구 사항에 따라 달라질 수 있다. 구조의 유연성을 가능하게 하기 위해, 본 발명의 실시예들에서는 작업자의 동적 클러스터링(또는 재구성 가능한 토폴로지)을 제안하여 2차원의 미리 계산된 통신량을 기반으로 상호 연결 네트워크를 재구성한다. 타일 수집을 위한 통신량을 감소시키기 위해, 본 발명의 실시예들에서는 위노그라드 도메인 데이터 요소의 양자화된 값으로부터 정확도의 손실 없이 뉴런의 활성화를 예측하는 방법을 제안한다.MPT reduces weight communication (transmission) between workers, but can create additional challenges. MPT includes a new type of communication within a cluster to collect/scatter the Winograd domain characteristic map, which is referred to as tile transmission in embodiments of the present invention. In order to efficiently support tile transmission, embodiments of the present invention propose a hybrid topology. The hybrid topology utilizes a low-diameter (high-radix) topology for tile communication across the group. Alternatively, a ring topology is used for weight communication through the cluster. However, the optimal configuration (i.e., the number of workers for each dimension) can vary depending on the communication requirements of the convolutional layer. To enable structural flexibility, embodiments of the present invention propose dynamic clustering (or reconfigurable topology) of workers to reconstruct the interconnection network based on two-dimensional pre-computed traffic. In order to reduce the traffic for tile collection, embodiments of the present invention propose a method of predicting the activation of neurons without loss of accuracy from quantized values of the Winnograd domain data elements.

위노그라드 도메인에서 계산량이 감소하는 반면, 위노그라드 도메인의 타일 및 가중치는 공간 도메인의 특성 맵 및 가중치보다 커서 결과적으로 더 많은 데이터 액세스가 발생한다. 5개의 서로 다른 레이어의 다이렉트 컨볼루션 및 두 가지 위노그라드 변환 컨볼루션에 대한 연산과 메모리 액세스의 비교가 도 1에 표현되어 있다. 이 결과에 따르면, 위노그라드 변환 컨볼루션은 연산을 2.8×까지 감소시킬 수 있는 반면, 평균적으로 데이터 액세스 양을 4.4×까지 증가시킨다. 결과적으로, 본 발명의 실시예들에서는 증가된 데이터 액세스를 위해 3D 스택 메모리의 고-대역폭을 활용하고 메모리 중심 구조를 생성하는 확장 가능한 NDP 구조를 제안한다. 특히, 본 발명의 실시예들은 아래와 같은 사항들을 포함한다.While the amount of computation decreases in the Winograd domain, the tiles and weights of the Winograd domain are larger than the feature maps and weights of the spatial domain, resulting in more data access. The comparison of operations and memory accesses for five different layers of direct convolution and two Winograd transform convolutions is illustrated in FIG. 1. According to these results, Winograd transform convolution can reduce the operation to 2.8×, while on average it increases the amount of data access to 4.4×. As a result, embodiments of the present invention propose a scalable NDP structure that utilizes the high-bandwidth of the 3D stack memory for increased data access and creates a memory-centric structure. In particular, embodiments of the present invention include the following.

● 위노그라드 도메인에서 사용할 수 있는 데이터와 인트라-타일(또는 요소 단위) 병렬 처리를 모두 활용하는 위노그라드 도메인 컨볼루션의 새로운 MPT을 제안한다.● We propose a new MPT for Winnograd Domain Convolution that utilizes both intra-tile (or element-wise) parallel processing and data that can be used in the Winnograd domain.

● MPT를 사용하여 작업자 간의 통신을 효율적으로 지원하기 위해, 데이터 병렬 처리에서 집단 통신을 위한 링 토폴로지 및 타일 전송에 높은 연결 토폴로지를 사용하는 하이브리드 토폴로지를 가진 작업자의 2D 조직을 제안한다.● To efficiently support communication between workers using MPT, we propose a 2D organization of workers with hybrid topologies that use ring topologies for collective communication in data parallel processing and high connection topologies for tile transmission.

● MPT 구조에서 작업자가 두 가지 유형의 통신(가중치 그라디언트 및 타일 전송)의 균형을 최적화할 수 있도록 하는 동적 클러스터링을 제안한다.● In the MPT architecture, dynamic clustering is proposed, which enables the operator to optimize the balance of two types of communication (weighted gradient and tile transmission).

● 타일 수집에 필요한 대역폭을 감소시키기 위해 정확도 손실 없이 위노그라드 도메인 데이터의 양자화된 값으로부터 공간 도메인 뉴런의 활성화 예측을 제안한다.● To reduce the bandwidth required for tile collection, we propose a prediction of activation of spatial domain neurons from quantized values of the Winograd domain data without loss of accuracy.

2. 배경2. Background

본 연구에서는 다음 기호를 사용하여 컨볼루션 알고리즘을 설명한다.This study describes the convolution algorithm using the following symbols.

x, y, w : 공간 도메인 입력, 출력 특성 맵, 가중치 x , y , w : spatial domain input, output property map, weight

X, Y, W: 위노그라드 도메인 입력, 출력 특성 맵, 가중치 X , Y , W : Winograd domain input, output characteristic map, weight

I, J, B: 입력 수, 출력 채널 수, 배치 크기 I , J , B : Number of inputs, number of output channels, batch size

A. 컨볼루션 레이어A. Convolution Layer

컨볼루션 레이어 학습은 순방향 전파, 역방향 전파 및 가중치 업데이트의 세 단계로 구성된다. 포워드 전파(forward propagation, 이하 'fprop')동안, 입력 특성 맵 x _i 가 가중치 w _i _, _j 와 컨볼루션되어 ReLU(Rectified Linear Unit)와 같은 비선형 활성화 함수 f를 통과하여 출력 특성 맵 y _j 가 연산된다.Convolutional layer learning consists of three steps: forward propagation, reverse propagation and weight update. Forward propagation (forward propagation, hereinafter 'fprop') for the input characteristic map x _i are weights w _{_i,} _j and the convolution ReLU (Rectified Linear Unit) and passes through the non-linear activation function f output characteristic map y _j is calculated as do.

백워드 전파(backward propagation, 이하 'bprop') 동안, 입력

에 대한 손실 함수 L의 그라디언트가 출력 그라디언트

및 플립 가중치 (w ^f )의 컨볼루션에 의해 연산된 후, 활성 함수 (f´)의 미분으로 전달된다.While the backward propagating (backward propagation, hereinafter 'bprop'), input

The gradient of the loss function L for the output gradient

And after being calculated by the convolution of the flip weight ( w ^f ), it is transferred as a derivative of the active function ( f ´).

가중치 업데이트 단계(weight update phase, 이하, 'updateGrad') 동안, 출력 그라디언트와 입력 특성 맵 사이의 컨볼루션에 의해 가중치 그라디언트

가 연산된다.During the weight update phase (hereinafter referred to as'updateGrad '), the weight gradient is generated by convolution between the output gradient and the input characteristic map.

Is calculated.

학습 속도(learning rate)가 곱해진 가중치 그라디언트가 추가함에 의해(또는 평균 값을 계산함에 의해) 가중치가 업데이트된다.The weight is updated by adding a weight gradient multiplied by the learning rate (or by calculating the average value).

B. 위노그라드 레이어(Winograd layer)B. Winograd layer

컨볼루션 연산을 요소 단위 내적 연산 (⊙)으로 대체하여 곱의 수를 감소시키기 위해, 2D 컨볼루션에 대한 위노그라드 알고리즘이 제안되었다. 2D 위노그라드 변환은 F(m × m, r × r)와 같이 나타낼 수 있으며, 아래 수학식 4와 같이 표현될 수 있다.In order to reduce the number of products by replacing the convolution operation with the element-wise dot product operation (⊙), a Winograd algorithm for 2D convolution has been proposed. The 2D Winograd transformation can be expressed as F ( m × m , r × r ), and can be expressed as Equation 4 below.

여기서, w는 r × r 크기 가중치이며, y는 m × m 크기 출력이고, G, B, A는 계수 행렬(co-efficient matrices)이다. 이론적으로 다이렉트 컨볼루션과 비교할 때, F(4 × 4, 3 × 3)는 컨볼루션 연산을 내적으로 변환함으로써 곱을 4배까지 감소시킨다.Here, w is an r × r size weight, y is an m × m size output, and G , B , and A are coefficient matrices. Theoretically, compared to direct convolution, F (4 × 4, 3 × 3) reduces the product by a factor of four by transforming the convolution operation internally.

위노그라드 알고리즘을 컨볼루션 레이어에 적용하기 위하여, 종래기술에서는 타일 기반 접근법을 제안하였다. 입력 특성 맵은 각 타일이 T × T의 크기를 가지는 복수의 타일들로 변환되고(여기서, T = m + r - 1), 가중치들 또한 T × T의 동일한 크기를 가지는 타일로 변환된다. 복수의 입력 타일들과 가중치 타일 사이의 내적은 출력 타일들을 생성하며, 출력 타일들은 출력 특성 맵으로 역변환된다. 이때, fprop에 대한 전반적인 연산은 아래 수학식 5와 같은 행렬 방정식으로 나타낼 수 있다.In order to apply the Winograd algorithm to the convolutional layer, the prior art proposed a tile-based approach. In the input characteristic map, each tile is converted into a plurality of tiles having a size of T × T (where T = m + r -1), and weights are also converted to tiles having the same size of T × T. The dot product between the plurality of input tiles and the weight tile generates output tiles, and the output tiles are inversely transformed into an output characteristic map. At this time, the overall operation for fprop can be expressed by a matrix equation as shown in Equation 5 below.

여기서, (u, v)는 타일에서 u번째 행과 v번째 열에 위치한 원소를 나타낸다.Here, ( u , v ) denotes elements located in the u- th row and v- th column of the tile.

도 2는 위노그라드 변환을 이용한 컨볼루션 레이어의 fprop, updateGrad를 보여준다. bprop는 입력과 출력이 각각

와

로 교체되는 것을 제외하고 fprop와 유사한 연산 및 데이터 세트를 가진다. 위노그라드 레이어는 최근에 제안되었으며, 이 제안에서 가중치는 도 3에서와 같이 위노그라드 도메인에서 직접 업데이트될 수 있다. 이러한 점은 위노그라드 레이어가 본 발명의 실시예들에서 사용되는 작은 가중치/타일 크기에 대해 학습된 네트워크의 품질을 감소시키지 않는 반면, 위노그라드 도메인 가중치가 공간 도메인 가중치에 비해 큰 사이즈를 가져 정확성을 증가시킴을 보여준다. Nvidia의 현재 cuDNN은 위노그라드 변환 컨볼루션을 지원하지만, 위노그라드 변환 컨볼루션이 정확도에 영향을 미치지 않는 사이즈인 3 × 3 및 5 × 5 로 그 사이즈가 제한된다. 그러나, 가중치/타일 사이즈가 증가하면, 수치적 불안정성이 증가하여 정확성에 영향을 줄 수 있다. 최근 연구는 향상된 위노그라드 변환 매트릭스를 통하여 향상된 안정성과 정확성을 달성하는 방법을 활용하였다. 이러한 방법을 통해 정확성이 향상되면, 본 발명의 실시예들에서 제안하는 다중 차원 병렬 구조가 더욱 큰 가중치 사이즈로 확장될 수 있다.Figure 2 shows the fprop , updateGrad of the convolution layer using the Winograd transformation. bprop has input and output respectively

Wow

It has a similar operation and data set as fprop , except that it is replaced by. The Winograd layer was recently proposed, and in this proposal, the weight can be directly updated in the Winograd domain, as shown in FIG. 3. This does not reduce the quality of the learned network for the small weight/tile size used in the embodiments of the present invention, while the Winograd domain weight has a larger size compared to the spatial domain weight, thus providing accuracy. Increase. Nvidia's current cuDNN supports Winograd transform convolution, but its size is limited to 3 × 3 and 5 × 5, which are the sizes that Winograd transform convolution does not affect accuracy. However, as the weight/tile size increases, numerical instability increases, which can affect accuracy. A recent study utilized a method to achieve improved stability and accuracy through an improved Winograd transformation matrix. If the accuracy is improved through this method, the multi-dimensional parallel structure proposed in the embodiments of the present invention can be extended to a larger weight size.

C. 관련 연구C. Related Research

가속기: 다른 CNN 가속기 구조가 제안되었다. 이 종래기술은 단일 칩에서의 추론(inference)에 대한 데이터 재사용과 파워 효율성을 최대화한다. 대부분의 가속기들이 추론에 중점을 두어 제안된 반면, 최근 확장성과 효율적인 연산을 위한 신경망 학습을 위해, 이기종(heterogeneous) 서버 구조가 제안되었다. 이용 가능한 통신 대역폭을 효율적으로 이용하기 위해, 데이터 병렬 처리를 위한 링 토폴로지나 트리 토폴로지를 이용한 집단 통신 가속화를 포함하여 여러 개의 GPU를 사용하는 확장 가능한 병렬 학습이 널리 연구되어왔다. 이와는 대조적으로, 본 발명의 실시예들에서는 변환된 컨볼루션에서의 인트라 타일 병렬화를 이용하고 독립적인 집단 운영 그룹을 생성함으로써 작업자의 수가 증가함에 따라 통신의 양을 감소시키는 것을 제안한다.Accelerator: Another CNN accelerator architecture has been proposed. This prior art maximizes data reuse and power efficiency for inference on a single chip. While most accelerators have been proposed with an emphasis on reasoning, a heterogeneous server architecture has recently been proposed for neural network learning for scalability and efficient computation. In order to efficiently utilize available communication bandwidth, scalable parallel learning using multiple GPUs has been widely studied, including ring communication for data parallel processing or acceleration of collective communication using a tree topology. In contrast, embodiments of the present invention propose to reduce the amount of communication as the number of workers increases by using intra tile parallelism in transformed convolution and creating independent group operations groups.

확장 가능한 병렬 학습: 병렬 학습을 위한 다른 접근법은 배치 크기를 증가시키는 것이지만, 큰 배치 크기로 학습된 신경망 모델은 덜 일반화되는 경향이 있다("일반화 격차"라고 함). 최근, 딥 러닝은 배치 크기에 따라 학습률을 선형적으로 조정하여 학습 모델의 품질을 유지하면서 배치 크기를 증가시키는 연구가 활발히 이루어지고 있다. 그러나, 배치 크기가 크면 수용 가능한 테스트 정확도로 안정적인 수렴을 제공하는 학습 속도의 범위가 줄어들 수 있다. 배치 크기가 작으면 더 많은 가중치 값을 제공하기 때문이다. 큰 배치 크기의 학습은 학습된 네트워크와 동일한 품질을 제공하는 데 더 오래 걸릴 수 있다. 배치 크기의 제한된 조정으로 인해, 연산 대 통신 비율은 작업자 수가 증가할 때 증가한다(또는 아키텍처의 발전을 통해 각 작업자의 컴퓨팅 능력이 증가함). 합리적인 배치 크기로 높은 확장성을 제공하기 위해, 본 발명의 실시예들에서는 제안된 다차원 병렬 처리 및 하이브리드 상호 연결 구조를 통해 통신 시간을 감소시킨다.Scalable parallel learning: Another approach for parallel learning is to increase the batch size, but neural network models trained with large batch sizes tend to be less generalized (called the "generalization gap"). Recently, deep learning has been actively conducted to increase the batch size while maintaining the quality of the learning model by linearly adjusting the learning rate according to the batch size. However, a large batch size can reduce the range of learning speeds that provide stable convergence with acceptable test accuracy. This is because a smaller batch size provides more weight values. Large batch-sized learning can take longer to provide the same quality as a trained network. Due to the limited adjustment of the batch size, the ratio of operation to communication increases as the number of workers increases (or the computing power of each worker increases as the architecture evolves). In order to provide high scalability with a reasonable batch size, embodiments of the present invention reduce communication time through the proposed multidimensional parallel processing and hybrid interconnect structure.

3D 통합: 3D 통합 기술의 최근 발전으로 인해 TSV(Through-Silicon Vias)가 포함된 3D 스택 다이(stacked dies)가 가능하다. 하이브리드 메모리 큐브(Hybrid Memory Cube, HMC) 는 바닥에 로직 다이가 존재하는 3D 스택 메모리의 한 예이며, 근접 데이터 처리(near data processing, NDP)를 위한 프로세싱 요소를 포함한다. 각각 라우터(router)가 되는 HMC 모듈은 고속 링크로 상호 연결되어 시스템을 확장하고 메모리 중심 네트워크를 생성할 수 있다. 최근 CNN을 위한 근접 데이터 처리가 3D 스택 메모리의 고 대역폭 및 에너지 효율을 활용하는 방법으로 제안되었다. 그러나, 이전의 연구는 단일 메모리 모듈에 대한 추론에 중점을 두는 반면, 본 발명의 실시예들에서는 여러 메모리 모듈에 대한 학습에 중점을 둔다. 최근의 몇몇 연구들에서는 다중 메모리 모듈을 평가했지만, 메모리 모듈의 수가 몇 개(최대 4 개)로 제한되어 있으며, 메모리 모듈 간의 통신 오버 헤드가 제대로 평가되지 않았다. 이와는 대조적으로, 본 발명의 실시예들에서는 많은 수의 작업자들(메모리 모듈들)을 이용한 확장 가능한 학습에 중점을 두고 있으며, 통신이 확장성에 미치는 영향에 대해 중점을 두고 있다. 본 발명의 실시예들에서는 작업자 간의 고-대역폭 직렬 링크 연결을 가정한다. NVlink는 이러한 상호 연결의 한 예이며, CCIX, OpenCAPI 및 GenZ를 포함하여 업계에서 최근 발표된 상호 연결 표준은 고-대역폭을 제공하기 위해 상호 연결 가속기 및/또는 메모리 모듈을 상호 연결하는 것을 목표로 하는 상호 연결의 예이다.3D Integration: Recent advances in 3D integration technology enable 3D stacked dies with through-silicon vias (TSV). A hybrid memory cube (HMC) is an example of a 3D stack memory in which a logic die is present at the bottom, and includes processing elements for near data processing (NDP). Each HMC module that becomes a router can be interconnected by a high-speed link to expand the system and create a memory-centric network. Proximity data processing for CNN has recently been proposed as a way to utilize the high bandwidth and energy efficiency of 3D stack memory. However, the previous study focuses on the reasoning of a single memory module, while embodiments of the present invention focus on learning about multiple memory modules. Several recent studies have evaluated multiple memory modules, but the number of memory modules is limited to a few (up to 4), and the communication overhead between memory modules is not properly evaluated. In contrast, embodiments of the present invention focus on scalable learning using a large number of workers (memory modules) and focus on the impact of communication on scalability. The embodiments of the present invention assume a high-bandwidth serial link connection between operators. NVlink is an example of such an interconnect, and recently announced interconnect standards in the industry, including CCIX, OpenCAPI, and GenZ, aim to interconnect interconnect accelerators and/or memory modules to provide high-bandwidth. This is an example of interconnection.

3. 확장 가능한 병렬 학습3. Scalable parallel learning

이하에서는 데이터 병렬 처리와 인트라-타일 병렬 처리를 결합하여 확장성을 높이는 MPT에 대해 설명하고, MPT의 확장성 문제를 설명한다.Hereinafter, MPT for increasing scalability by combining data parallel processing and intra-tile parallel processing will be described, and the scalability problem of MPT will be described.

A. 변환 컨볼루션에서의 병렬 처리A. Parallel Processing in Transform Convolution

변환 컨볼루션의 내적은 도 4와 같이 나타낼 수 있다. 각 입력 특성 맵은 T × T의 각 타일 크기와 함께 t 개의 타일로 변환되어 입력 채널 당 총 B × t 개의 타일이 생성된다. T × T 크기를 갖는 각 타일을 T ² 요소로 분해함으로써, 타일들의 내적은 수학식 5 및 도 5에서 나타낸 바와 같이 T ² 독립적 요소 단위 행렬 곱셈으로 나타낼 수 있다.The dot product of the transformation convolution can be represented as shown in FIG. 4. Each input characteristic map is converted into t tiles along with the size of each tile of T × T , thereby generating a total of B × t tiles per input channel. By decomposing each tile having a size of T × T into T ² elements, the dot product of the tiles can be represented by T ² independent element unit matrix multiplication as shown in Equation 5 and FIG. 5.

위노그라드 도메인 내적에서, 타일 내 서로 다른 위치(즉, 서로 다른 (u, v))에 배치된 요소 사이에는 연산이 수행되지 않는다. 변환 컨볼루션에서 사용할 수 있는 이러한 독립성 또는 병렬 처리는 최소한의 통신으로 병렬화에 활용할 수 있다. 본 실시예에서는 이를 인트라-타일 병렬 처리(intra-tile parallelism)라 부른다. 도 6 및 도 7은 다이렉트 컨볼루션과 비교한 인트라-타일 병렬 처리에 대한 다른 견해를 제공한다. 다이렉트 컨볼루션을 사용하면, 대부분의 입력 데이터 요소(또는 뉴런)가 서로 다른 출력 요소를 계산하기 위해 공유되며, 요소 단위로 분할된 실행은 큰 입력 데이터의 중복을 필요로 한다. 이와 대조적으로, 변환은 컨볼루션을 요소 단위 내적으로 변경하고, 입력 데이터를 요소 단위로 분리할 수 있으므로 효율적인 요소 단위(또는 인트라-타일) 병렬 실행이 가능해진다. 도 6 및 도 7은 updateGrad 단계의 일 예를 보여 주지만, 다이렉트 컨볼루션 및 변환 컨볼루션의 서로 다른 특성이 fprop 및 bprop 단계에 적용된다.In the Winograd domain dot product, no operation is performed between elements disposed at different locations in the tile (ie, different ( u , v )). This independence or parallelism that can be used in transformation convolution can be utilized for parallelization with minimal communication. In this embodiment, this is called intra-tile parallelism. 6 and 7 provide different views of intra-tile parallel processing compared to direct convolution. With direct convolution, most input data elements (or neurons) are shared to compute different output elements, and partitioned execution by element requires large input data redundancy. In contrast, transformations can change convolution internally on an element-by-element basis and separate input data into element-by-element, enabling efficient element-by-element (or intra-tile) parallel execution. 6 and 7 show an example of the updateGrad step, but different characteristics of direct convolution and transform convolution are applied to the fprop and bprop steps.

B. 다차원 병렬 학습(Multi-dimensional Parallel Training, MPT)B. Multi-dimensional Parallel Training (MPT)

CNN의 병렬 학습을 위해, 데이터 병렬 학습이 컨볼루션 레이어에 널리 사용된다. 데이터 병렬 학습이 fprop 및 bprop 단계에서 작업자 간 통신을 필요로 하지 않지만, updateGrad 단계에서는 모든 작업자가 생성한 가중치 그라디언트(

또는

)를 감소시키고 업데이트된 가중치를 브로드캐스팅해야 한다. 따라서, updateGrad 단계에서 각 작업자가 전달할 통신의 양은 거의 고정되어 있고 작업자의 수(p)에 관계없이 가중치 사이즈 |w|에 비례한다.For parallel learning of CNN, data parallel learning is widely used in the convolutional layer. Parallel training of data does not require inter-worker communication in the fprop and bprop phases, but in the updateGrad phase the weight gradients generated by all workers (

or

) And broadcast the updated weights. Therefore, in the updateGrad phase, the amount of communication to be delivered by each worker is almost fixed, and regardless of the number of workers (p), the weight size | proportional to w |

본 실시예에서는 데이터 병렬 처리와 인트라-타일 병렬 처리를 결합한 MPT를 제안한다. MPT를 사용하면, 도 8 및 도 9와 같이 동일한 행에 존재하는 작업자가 클러스터를 구성하면서 작업자가 2차원으로 배치되고 동일한 열의 작업자가 그룹을 구성한다. 입력 배치가 N _c 클러스터 전체에 분산되는 것과 같이 데이터 병렬 처리가 클러스터 전체에 적용되며, 인트라-타일 병렬 처리가 그룹 전체에 적용되어 위노그라드 도메인 타일의 요소를 N _g 그룹에 분배한다. 따라서 N _c × N _g = p 와 N _g = 1의 경우, 데이터 병렬 처리만이 작업자 전체에 활용된다. 도 5와 도 7에서 볼 수 있듯이, 서로 다른 타일 요소 사이 또는 내적에 대한 서로 다른 그룹 사이에는 연산이 수행되지 않는다. 따라서 위노그라드 도메인 가중치 (W _u _, _v )의 각 부분은 연관된 그룹 내에서만 사용된다. 타일 전송은 이후 설명하는 fprop 및 bprop 단계에서 공간 도메인 뉴런과 위노그라드 도메인 타일 간의 변환에 대해서만 그룹 전체(또는 각 클러스터 내)에서 발생한다. updateGrad 단계에서, 각 작업자는 작업자 그룹에 할당된 가중치의 일부에 대해 가중치 그라디언트를 생성하고, 각 그룹 내에서 가중치 그라디언트의 통신이 발생한다. 결과적으로, 작업자 당 감소되고 브로드캐스팅되어야 하는 가중치 그라디언트의 크기는 N _g 만큼 감소한다.In this embodiment, an MPT combining data parallel processing and intra-tile parallel processing is proposed. When using MPT, as shown in FIGS. 8 and 9, workers existing in the same row form a cluster, and workers are arranged in two dimensions, and workers in the same column form a group. Data parallelism is applied to the entire cluster, as input batches are distributed throughout the N _c cluster, and intra-tile parallelism is applied to the entire group to distribute the elements of the Winograd domain tile to the N _g group. Therefore, in the case of N _c × N _g = p and N _g = 1, only data parallel processing is utilized for the entire operator. As can be seen in FIGS. 5 and 7, no operation is performed between different tile elements or between different groups for dot products. Therefore _, each part of the Winograd domain weights ( W _u _, _v ) is used only within the associated group. Tile transmission occurs in the entire group (or in each cluster) only for transformation between spatial domain neurons and Winograd domain tiles in the fprop and bprop steps described later. In the updateGrad phase, each worker generates a weight gradient for a portion of the weight assigned to the worker group, and communication of the weight gradient occurs within each group. As a result, the size of the weight gradient, which should be reduced and broadcast per worker, is reduced by N _g .

C. MPT의 문제(Challenges of MPT)C. Challenges of MPT

MPT는 가중치 업데이트에 대한 통신을 감소시키지만, 새로운 유형의 통신을 도입한다. fprop 및 bprop 단계에서, 입력 타일은 요소 단위로 분할되어 클러스터 내 작업자에게 인트라-타일 병렬 처리를 위해 분산된다. 이를 타일 스캐터링(tile scattering)이라고 한다. 각 작업자는 가중치 요소를 로드하고 작업자 그룹에 할당된 타일의 특정 부분에 대한 출력 요소를 연산한다. 각 출력 타일의 서로 다른 부분을 결합하여 공간 도메인 특성 맵(또는 뉴런)에 대한 역변환을 위한 완성(complete) 타일을 구성함으로써 클러스터 내에서 타일을 통신하게 한다. 이를 타일 수집(tile gathering)이라고 한다. 공간 도메인 뉴런은 비선형 활성화 및 풀링(존재하는 경우) 레이어를 통과하여 다음 레이어의 입력이 된다. 도 10은 아래 표 1의 두 개의 서로 다른 레이어의 반복마다 작업자별로 전송하는 통신량을 비교한다.MPT reduces communication for weight updates, but introduces a new type of communication. In the fprop and bprop phases, the input tiles are divided into elements and distributed for intra-tile parallel processing to workers in the cluster. This is called tile scattering . Each worker loads the weight elements and computes the output elements for a specific portion of the tile assigned to the worker group. The different parts of each output tile are combined to form a complete tile for inverse transformation to a spatial domain property map (or neuron) to communicate the tiles within the cluster. This is called tile gathering . Spatial domain neurons pass through the nonlinear activation and pooling (if any) layer and become the input of the next layer. FIG. 10 compares the amount of traffic transmitted by each worker for each repetition of two different layers in Table 1 below.

Abbr.Abbr. CNN layerCNN layer I, J I , J x(y) dim. x ( y ) dim. w dim. w dim. EarlyEarly ResNet-34 conv2_xResNet-34 conv2_x 128, 128128, 128 56 × 5656 × 56 3 × 33 × 3 Mid-1Mid-1 ResNet-34 conv4_xResNet-34 conv4_x 256, 256256, 256 14 × 1414 × 14 3 × 33 × 3 Late-1Late-1 ResNet-34 conv5_xResNet-34 conv5_x 512, 512512, 512 7 × 77 × 7 3 × 33 × 3 Mid-2Mid-2 WRN-40-10 conv3WRN-40-10 conv3 320, 320320, 320 16 × 1616 × 16 3 × 33 × 3 Late-2Late-2 WRN-40-10 conv4WRN-40-10 conv4 640, 640640, 640 8 × 88 × 8 3 × 33 × 3

표 1은 본 발명의 일실시예에 따른 평가에서 사용되는 다섯 가지 컨볼루션 레이어들의 예를 나타내고 있다.도 10에서는 256의 배치 크기와

를 가정하였다. 도 10(b)에 표현된 바와 같이, 작은 특성 맵과 더 큰 가중치 크기가 존재하는 후반(late) 레이어의 경우, MPT는 가중치 통신을 크게 감소시키고, 타일 전송은 많은 수의 작업자(p)를 가지는 전체 통신량의 작은 부분으로 구성된다. 비교적 큰 특성 맵 크기를 갖는 레이어에서(예를 들어, Mid-1), MPT는 여전히 가중치 통신을 현저히 감소시키지만, 타일 전송은 많은 양의 통신을 초래하여 MPT의 이점을 감소시킨다(도 10(a)).Table 1 shows examples of five convolutional layers used in evaluation according to an embodiment of the present invention.

Was assumed. As shown in FIG. 10(b), in the case of a late layer having a small characteristic map and a larger weight size, MPT greatly reduces weight communication, and tile transmission significantly reduces the number of workers p. The branches consist of a small portion of the total traffic. In a layer with a relatively large characteristic map size (e.g. Mid-1), MPT still significantly reduces weight communication, but tile transmission results in large amounts of communication, reducing the benefits of MPT (Figure 10(a). )).

추가된 통신에서 MPT의 확장성을 이해하기 위해, 먼저 데이터 병렬 처리 및 MPT로부터의 통신량이 분석될 수 있다. 링 토폴로지를 사용한 데이터 병렬 학습의 경우, 각 작업자는 |W| 크기의 가중치 그라디언트를 다음 작업자에게 전송하고, 홉 수(hop count)는 p-1이다. 따라서, 작업자 당 통신량은

이며, p가 크면 거의 일정하다. MPT의 경우, 가중치는 N _g 그룹으로 나누어지고, 각 작업자는 그룹 당 N _c - 1 홉 순회(traversals)를 이용하여

가중치 그라디언트만 전송한다. 따라서, 가중치 그라디언트에 대한 작업자 1 인당 통신량은

가 된다. 타일 전송의 경우, 입력 배치(B의 크기)가 N _c 클러스터로 분할되고, 타일이 각 클러스터 내부의 N _g 그룹으로 분할된다. 따라서, 각 작업자는

타일 데이터를 보유하고, 타일 데이터의

부분을 동일한 클러스터의 다른 작업자에게 전송한다. 타일 전송을 위한 작업자 1인당 통신량은

가 된다.In order to understand the scalability of MPT in the added communication, data parallel processing and traffic from MPT can be analyzed first. For parallel training of data using a ring topology, each worker | W | The weight gradient of size is transmitted to the next worker, and the hop count is p -1. Therefore, the traffic per worker is

And p is large, almost constant. In the case of MPT, weights are divided into N _g groups, and each worker uses N _c -1 hop traversals per group.

Only the weight gradient is transmitted. Therefore, the traffic per worker for the weight gradient is

Becomes. For tile transmission, the input arrangement (the size of B ) is divided into N _c clusters, and the tiles are divided into N _g groups within each cluster. Therefore, each worker

Holds the tile data, and the tile data

The part is sent to other workers in the same cluster. The traffic per worker for tile transmission is

Becomes.

도 11은 모든 레이어에서 FractalNet 학습을 반복할 때마다 작업자 별로 전달하는 데이터의 양(통신량)이 표시되는 예를 나타내고 있다. 이때, 고정 배치 크기가 256인

를 가정한다. 작업자 수가 적으면, 타일 전송 오버 헤드로 인해 데이터 병렬 학습과 비교할 때 실제로 통신 오버 헤드가 MPT보다 높다. 데이터 병렬 학습의 경우, 고정된 총 배치 크기로 작업자 수가 증가함에 따라 각 작업자에게 할당된 연산량이 감소하기 때문에, 작업자 당 통신량은 더 큰 p로 일정하게 유지되고 확장성이 떨어지게 된다. MPT의 경우, 작업자 당 통신량은 가중치 그라디언트에 대한

와 타일 전송에 대한

에 비례하므로 작업자의 수(p)가 증가함에 따라 작업자 1 인당 통신량은 감소한다.FIG. 11 shows an example in which the amount of data (communication amount) delivered for each worker is displayed whenever FractalNet learning is repeated in all layers. At this time, the fixed batch size is 256

Suppose When the number of workers is small, the communication overhead is actually higher than that of MPT compared to data parallel learning due to tile transmission overhead. In the case of data parallel learning, since the amount of computation allocated to each worker decreases as the number of workers increases with a fixed total batch size, the communication amount per worker remains constant at a larger p and the scalability decreases. In the case of MPT, the traffic per worker is the weight gradient.

For tile transfer

Since it is proportional to the number of workers ( p ) increases, the traffic per worker decreases.

가중치 및 타일 전송에 요구되는 통신량은 컨볼루션 레이어의 구조에 따라 현저하게 다르며, 2개의 병렬 처리 도메인의 크기(즉, N _g , N _c )는 가중치 및 타일 전송의 통신량을 결정한다. 또한, 가중치와 타일의 통신량은 트레이드-오프 관계에 있다. 그룹이 많을수록 가중치 통신은 줄어들지만 타일 전송은 증가하고, 그룹 수가 적으면 가중치 통신이 증가하지만 타일 전송은 감소한다. 예를 들어, 단일 그룹에서, 데이터 병렬 처리만이 남아 타일 전송이 완전히 제거된다. 따라서, MPT의 최적 구성은 각각의 컨볼루션 레이어 대한 전체 통신량 변화를 최소화하는 것이다. 본 실시예에서는 전체적인 통신을 최소화하기 위해 레이어 당 MPT를 재구성할 수 있는 동적 클러스터링을 제안한다. 또한, 도 11은 동적 클러스터링을 사용하는 MPT의 결과를 보여 주며, 요구되는 통신량이 감소한다. 적은 수의 작업자의 경우, 고정된 상호 연결이 서로 다른 레이어들에 걸쳐 비효율적인 대역폭 사용을 초래하기 때문에 동적 클러스터링의 이점이 중요하다. 그러나, 동적 클러스터링의 이점은 여전히 p=256을 위한 1.4×에 의한 통신 감소를 초래한다. 이하에서는 하이브리드 토폴로지를 사용한 동적 클러스터링의 세부 사항을 제시하고, 섹션 V에서는 뉴런의 활성화를 예측하고 비-활성 뉴런으로 변환된 타일 모음을 바이패스함으로써 타일 수집의 통신량을 감소시키는 활성화 예측 방법이 제시된다. 도 12는 MPT에 대한 동적 클러스터링 및 활성화 예측의 적용을 보여준다.The traffic required for the weight and tile transmission is significantly different depending on the structure of the convolutional layer, and the size of the two parallel processing domains (ie, N _g , N _c ) determines the traffic for the weight and tile transmission. In addition, the weight and the communication amount of tiles are in a trade-off relationship. As there are more groups, weight communication decreases, but tile transmission increases. If the number of groups decreases, weight communication increases, but tile transmission decreases. For example, in a single group, only data parallel processing remains and tile transmission is completely eliminated. Therefore, the optimal configuration of the MPT is to minimize the total traffic change for each convolution layer. In this embodiment, dynamic clustering that can reconstruct MPT per layer to minimize overall communication is proposed. In addition, Fig. 11 shows the result of MPT using dynamic clustering, and the required communication volume is reduced. For a small number of workers, the benefits of dynamic clustering are important because fixed interconnects result in inefficient use of bandwidth across different layers. However, the advantage of dynamic clustering still leads to reduced communication by 1.4× for p =256. Hereinafter, details of dynamic clustering using a hybrid topology are presented, and in section V, an activation prediction method for predicting activation of a neuron and bypassing a collection of tiles converted to non-active neurons to reduce the traffic of tile collection is presented. . 12 shows the application of dynamic clustering and activation prediction to MPT.

4. 동적 클러스터링4. Dynamic clustering

본 실시예에서는 그룹 N _g 및 클러스터 N _c 의 수를 신경망 레이어의 특성과 일치하도록 재구성할 수 있는 제안된 동적 클러스터링에 대해 설명한다. 높은 수준의 아키텍처가 도 13(a)에 나타나 있으며, 메모리 중심 네트워크를 통해 상호 연결된 작업자 또는 "메모리 모듈"로 구성된다. 도 13은 본 실시예에서 제안하는 MPT와 일치시키기 위해 두 가지 도메인(그룹 및 클러스터)을 통해 조직된 256명의 작업자가 있는 아키텍처를 보여준다. 작업자들은 메모리 중심 네트워크를 통해 각 그룹 내의 링 인터커넥트(ring interconnect)와 상호 연결되어 가중치의 집단 연산을 지원한다. 데이터 병렬 학습에 비해, MPT는 다중이며 짧은 링을 생성한다. 즉, p 작업자의 경우, 링의 길이가 p(데이터 병렬)에서 N _c (MPT)로 감소하여 N _g 링이 된다. 여기서, N _g 링의 각 링은 오직

크기 데이터에 대해 집단 연산을 수행한다. 클러스터 내부의 16명의 작업자가 타일 수집/분산을 위해 올-투-올 트래픽(all-to-all traffic)을 효율적으로 지원하는 고밀도 네트워크인 2D FBFLY(flattened butterfly)가 사용된다.In this embodiment, the proposed dynamic clustering that can reconstruct the number of groups N _g and clusters N _c to match the characteristics of the neural network layer is described. The high-level architecture is shown in Figure 13(a) and consists of workers or "memory modules" interconnected through a memory-centric network. 13 shows an architecture with 256 workers organized through two domains (groups and clusters) to match the MPT proposed in this embodiment. Workers are interconnected with ring interconnects in each group through a memory-centric network to support the collective computation of weights. Compared to data parallel learning, MPT is multiple and produces short rings. That is, in the case of the p worker, the length of the ring is reduced from p (data parallel) to N _c (MPT), resulting in an N _g ring. Here, each ring of the N _g ring is only

Perform a group operation on the size data. A 2D flattened butterfly (FBFLY), a high-density network that efficiently supports all - to - all traffic, is used for tile collection/distribution by 16 workers inside the cluster.

앞서 설명한 바와 같이, 그룹 수 N _g 와 클러스터 수 N _c 는 가중치 그라디언트 및 타일 수집/스캐터링(scattering)에 대한 통신량의 트레이드-오프를 생성한다. 타일 전송량을 최소화하기 위해 더 큰 특성 맵 크기를 갖는 초기 레이어에 대해서는 더 작은 N _g (그리고 더 큰 N _c )가 바람직하고, 가중치 그라디언트의 양을 최소화하기 위해 더 큰 가중치 크기를 갖는 후기 레이어에 대해서는 더 큰 N _g (그리고 더 작은 N _c )가 바람직하다. 각 레이어에 대한 작업자의 유연한 구성을 가능하게 하기 위해, 본 실시예에서는 그룹 및 클러스터의 수를 수정하고 통신 시간을 최소화하면서 확장성을 극대화하기 위해 상호 연결을 재구성하는 동적 클러스터링을 제안한다.As described above, the group number N _g and the cluster number N _c create a trade-off of traffic for weight gradients and tile collection/scattering. A smaller N _g (and a larger N _c ) is preferred for the initial layer with a larger feature map size to minimize tile transfer, and a later layer with a larger weight size to minimize the amount of weight gradient. Larger N _g (and smaller N _c ) are preferred. In order to enable flexible configuration of workers for each layer, this embodiment proposes dynamic clustering to reconfigure interconnects to maximize scalability while modifying the number of groups and clusters and minimizing communication time.

토폴로지를 동적으로 변경(또는 N _g 및 N _c 수정)하기 위해, 본 실시예에서는 여러 그룹을 서로 연결하는 호스트를 통한 연결을 활용한다. N _c 가 증가하면(또는 N _g 가 감소하면), 호스트를 통한 연결이 그룹의 크기를 늘리기 위해 활용된다. 신경망은 고정된 레이어 구조를 가지고 요구되는 통신량과 링크 대역폭이 미리 계산할 수 있으므로, 통신 시간을 최소화하기 위한 레이어 별 최적의 구성이 미리 결정되며 변경되지 않는다. 두 레이어 간의 재구성은 타일 전송 및 가중치 통신의 경로(또는 대상)를 수정하며, 추가 데이터 전송이나 오버 헤드가 발생하지 않는다.In order to dynamically change the topology (or modify N _g and N _c ), this embodiment utilizes a connection through a host connecting several groups to each other. When N _c increases (or N _g decreases), connections through the host are utilized to increase the size of the group. Since the neural network has a fixed layer structure and the required traffic and link bandwidth can be calculated in advance, the optimal configuration for each layer to minimize communication time is predetermined and does not change. Reconstruction between the two layers modifies the path (or target) of tile transmission and weight communication, and no additional data transmission or overhead occurs.

평가에서는 다음과 같은 세 가지 구성의 작업자를 지원한다:The evaluation supports three types of workers:

● 16 N _g , 16 N _c : 호스트를 통한 라우팅 없음.● 16 N _g , 16 N _c : No routing through the host.

● 4 N _g , 64 N _c : gr0⇔gr3, gr4⇔gr7, gr8⇔gr11, gr12⇔gr15● 4 N _g , 64 N _c : gr0⇔gr3, gr4⇔gr7, gr8⇔gr11, gr12⇔gr15

● 1 N _g , 256 N _c : gr0⇔gr15, gr3⇔gr4, gr7⇔gr8, gr11⇔gr12● 1 N _g , 256 N _c : gr0⇔gr15, gr3⇔gr4, gr7⇔gr8, gr11⇔gr12

위에서 '⇔'는 호스트(도 13(a)의 'Host CPU')를 통하여 두 그룹 간 형성된 추가적인 연결을 나타낸다. 첫 번째 구성(16 N _g , 도 13(b))은 가중치 통신을 최소화하는 데 사용될 수 있다. 타일 크기가 4 × 4 인 F(2 × 2, 3 × 3) 변환의 경우, 타일의 단일 요소가 각 그룹에 할당된다. 각 클러스터 내부의 16명의 작업자를 상호 연결하기 위해, 2D FBFLY 토폴로지가 사용되며, 타일 데이터는 최대 2 홉 카운트로 전송될 수 있다. 두 번째 구성(4 N _g , 도 13(c))은 16 N _g 를 사용하는 구성에 비해 가중치 통신이 많지만, 타일 전송은 적다. 타일 크기가 4 × 4 인 F(2 × 2, 3 × 3) 변환의 경우, 4개 요소(2D 타일의 단일 라인)가 각 그룹에 할당되며, 타일 데이터를 전송하기 전에 1D Winograd 변환을 실행하여 타일 전송을 더 감소시킬 수 있다. 자세한 내용은 다음 섹션 V에 설명되어 있다. 완전히 연결된 4명의 작업자(예: 도 13(a)의 작업자 3, 7, 11, 15)가 클러스터를 구성하고, 타일 데이터는 단일 홉으로 전송될 수 있다. 세 번째 구성(1 N _g , 도 13(d))은 타일 전송을 위한 통신이 없는 데이터 병렬 처리로 돌아가지만, 가장 큰 가중치 통신량을 갖는다. 각 레이어의 특성에 따라, 통신 시간을 최소화하고 성능을 최대화하기 위해 보다 최적의 구성이 선택된다.In the above,'⇔' represents an additional connection formed between two groups through a host ('Host CPU' in FIG. 13(a)). The first configuration (16 N _g , FIG. 13(b)) can be used to minimize weight communication. In the case of an F (2×2, 3×3) transform with a tile size of 4×4, a single element of the tile is assigned to each group. In order to interconnect 16 workers in each cluster, a 2D FBFLY topology is used, and tile data can be transmitted at a maximum of 2 hop counts. The second configuration (4 N _g , FIG. 13(c)) has more weight communication than the configuration using 16 N _g , but has less tile transmission. For an F (2 × 2, 3 × 3) transformation with a tile size of 4 × 4, 4 elements (a single line of 2D tiles) are assigned to each group, and a 1D Winograd transformation is executed before transmitting the tile data. The tile transmission can be further reduced. Details are described in the next section V. Four fully connected workers (eg, workers 3, 7, 11, 15 in FIG. 13(a)) form a cluster, and tile data may be transmitted in a single hop. The third configuration (1 N _g , FIG. 13(d)) returns to data parallel processing without communication for tile transmission, but has the largest weighted traffic. Depending on the characteristics of each layer, a more optimal configuration is selected to minimize communication time and maximize performance.

5. 활성화 예측5. Activation prediction

제안된 MPT을 이용하여, 위노그라드 도메인 특성 맵(타일)의 타일 전송-수집/ 스캐터링(scattering)이 필수적이다. 이하에서는 타일 전송에 필요한 통신을 최소화하는 방법에 대해 설명한다. 타일 스캐터링에 대한 통신량을 감소시키기 위해 제로 값 스킵(skipping)을 사용하고, 타일 수집을 위해 활성화 예측 방법을 제안한다.Using the proposed MPT, tile transmission-collection/scattering of the Winograd domain characteristic map (tile) is essential. Hereinafter, a method for minimizing communication required for tile transmission will be described. To reduce the traffic for tile scattering, zero value skipping is used, and an activation prediction method is proposed for tile collection.

A. 정확도 손실이 없는 활성화 예측A. Activation prediction without loss of accuracy

타일 수집의 통신량을 줄이기 위해, 본 실시예에서는 정확도 손실 없이 활성화를 예측하는 방법을 제안한다. 본 실시예에서는 양의 값을 갖는 뉴런만을 다음 레이어로 전송하는 비선형 활성화 함수로 ReLU(Rectified linear unit)를 가정한다. 타일에서 변환된 뉴런이 모두 활성화되지 않으면, 타일의 역변환이 생략될 수 있다. 불필요한 타일 수집을 스킵하기 위해, 본 실시예에서는 원본(source) 작업자가 실제 값을 대상(destination) 작업자에게 전송하기 전에 예측을 위한 타일 요소의 양자화된 값을 전송하도록 제안한다. 변환이 포함되어 있기 때문에, 주요 과제는 위노그라드 도메인 데이터(타일)의 양자화된 값으로부터 공간 도메인 뉴런의 활성화를 예측하는 것이다. 제안된 방법에서, 대상 작업자는 뉴런의 추정된 값과 가능한 최대의 양자화 오차를 함께 계산하여 보수적 예측(conservative prediction)을 수행한다. 다중 양자화 값이 계수가 곱해지고 더해지므로, 양자화 오차가 변환 과정에서 축적된다.In order to reduce the traffic of tile collection, this embodiment proposes a method for predicting activation without loss of accuracy. In this embodiment, it is assumed that a rectified linear unit (ReLU) is a nonlinear activation function that transmits only neurons having a positive value to the next layer. If all the neurons converted in a tile are not activated, the inverse transformation of the tile may be omitted. In order to skip unnecessary tile collection, the present embodiment proposes that the source worker transmits the quantized value of the tile element for prediction before transmitting the actual value to the destination worker. Because transformation is involved, the main challenge is to predict activation of spatial domain neurons from quantized values of the Winograd domain data (tiles). In the proposed method, the target operator performs conservative prediction by calculating the estimated value of the neuron and the maximum possible quantization error together. Since multiple quantization values are multiplied and added to the coefficients, quantization errors are accumulated in the transformation process.

이미지넷(ImageNet) 데이터 세트와 학습된 CNN을 사용한 실험에서는 위노그라드 도메인 타일의 값이 정규 분포를 따르는 것을 관찰하였다. 실제 값의 정규 분포를 적절하게 따르기 위해, 본 실시예에서는 도 14에서와 같이 비-균일 양자화를 사용한다. 실수 값의 범위는 여러 영역으로 분할되며, 각 영역은 동일한 수의 단계를 가진다. 범위 외의 실수 값은 오버 플로우로 표시된다. 실수 값의 영역이 변경되면 단계 크기 Δ가 두 배가 되고, 실수 값의 표준 편차 σ로 단계 크기가 결정된다. 단정도 플로트 넘버(single precision float number)에 대한 비 균일 양자화를 구현하기 위해, 간단한 정수 연산 로직, 비트 쉬프터(bit shifter) 및 1/Δ 의 로그 값을 사전 계산하여 저장하는 간단한 하드웨어 구현이 도 15에 나타나 있다.In experiments using the ImageNet data set and the trained CNN, we observed that the values of the Winnograd domain tiles follow a normal distribution. In order to properly follow the normal distribution of actual values, in this embodiment, non-uniform quantization is used as in FIG. The range of real values is divided into several regions, and each region has the same number of steps. Real values out of range are indicated by overflow. When the real value area is changed, the step size Δ is doubled, and the step size is determined by the standard deviation σ of the real value. To implement non-uniform quantization for a single precision float number, a simple hardware implementation that precomputes and stores logarithmic values of simple integer arithmetic logic, bit shifters, and 1/Δ is shown in FIG. 15. Is shown in

도 16은 활성화 예측의 전체 흐름을 보여 주며, 각 그룹에 할당된 데이터의 양에 따라 1D 예측 또는 2D 예측으로 분류될 수 있다. 그룹의 수가 적으면, 각 작업자는 타일의 전체 라인(행 또는 열)을 포함할 수 있으며, 첫 번째 1D 변환은 실수 값을 이용하여 원본 작업자에서 연산될 수 있다. 원본 작업자의 첫 번째 1D 변환 후, 1D 변환 출력의 양자화 값이 예측을 위해 대상 작업자에게 전송된다. 이 예측을 1D 예측이라고 한다. 이와는 대조적으로, 많은 수의 그룹에서, 각 작업자는 타일의 전체 라인을 포함하지 않으며, 원본 작업자는 활성화 예측을 위해 출력 요소(타일의 일부)의 양자화 값을 대상 작업자에게 전송한다. 이 예측을 2D 예측이라 한다.16 shows the entire flow of activation prediction, and may be classified as 1D prediction or 2D prediction according to the amount of data allocated to each group. If the number of groups is small, each worker can include the entire line (row or column) of tiles, and the first 1D transform can be computed on the original worker using real values. After the first 1D transformation of the original worker, the quantization value of the 1D transformation output is transmitted to the target worker for prediction. This prediction is called 1D prediction. In contrast, in a large number of groups, each worker does not include the entire line of tiles, and the original worker sends the quantization value of the output element (part of the tile) to the target worker for prediction of activation. This prediction is called 2D prediction.

대상 작업자에서, 2D 변환이 양자화 값과 양자화 해상도 모두에 적용되어 뉴런의 추정 값과 양자화의 최대(양수) 가능 오차를 계산한다. 해상도는 양자화 단계 크기(도 14의 1/2/4/8Δ) 또는 실수 값과 양자화 값 사이의 최대 차이를 나타낸다. 첫 번째 1D 변환 중, 양수(+) 및 음수(-) 최대 가능 오차가 계산된다. 위노그라드 변환은 기본적으로 수학식 4에서와 같이 계수 행렬(co-efficient matrix)을 사용하는 행렬 곱셈이다. 첫 번째 1D 변환의 양수(음수) 최대 가능 오차가 행렬 곱셈 중에 양수(음수) 항만을 추가하여 계산된다. 두 번째 1D 변환 중, 첫 번째 1D 변환의 양수(음수) 최대 오차에 양수(음수) 계수를 곱하여 최종 양수 최대 가능 오차를 계산한다. 정확도 손실 없이 보수적 예측을 수행하기 위해, "추정 값 + max. 오류 "가 0보다 작은 경우에 한하여 뉴런이 비활성될 것으로 예측된다. 따라서, 제안된 방법은 활성화 뉴런에 대해 비-활성화될 것으로 예측하는 거짓음성(false-negatives)을 허용하지 않는다.In the subject worker, a 2D transform is applied to both the quantization value and the quantization resolution to calculate the estimated value of the neuron and the maximum (positive) possible error of quantization. Resolution represents the quantization step size (1/2/4/8Δ in FIG. 14) or the maximum difference between the real value and the quantization value. During the first 1D transformation, positive (+) and negative (-) maximum possible errors are calculated. The Winograd transformation is basically a matrix multiplication using a co-efficient matrix as in Equation (4). The positive (negative) maximum possible error of the first 1D transformation is calculated by adding only positive (negative) terms during matrix multiplication. During the second 1D transformation, the final maximum positive error is calculated by multiplying the positive (negative) maximum error of the first 1D transformation by the positive (negative) coefficient. To perform conservative prediction without loss of accuracy, neurons are expected to be inactive only if “estimated value + max. error” is less than zero. Thus, the proposed method does not allow false-negatives to be predicted to be non-activated for activated neurons.

B. 예측 정확도 및 타일 전송B. Prediction accuracy and tile transmission

도 17은 F(2 × 2, 3 × 3) 변환을 사용하는 비 활성 타일 및 라인의 실제 및 예측 비율을 보여준다. 본 실시예에서는 사전 학습된 가중치와 CIFAR 및 ImageNet의 두 데이터 세트를 사용하여 이 비율을 측정한다. 비 활성 타일(라인)은 타일(라인)에서 변환된 뉴런이 2D 예측 케이스(1D 예측 케이스)에서 모두 비활성으로 예측된다는 것을 나타낸다. 도 17의 점선은 실제 값으로 측정한 비율을 보여주며, 예측의 상한선을 나타낸다. 4개의 영역으로 이루어진 비 균일 양자화는 위노그라드 도메인 값의 분포와 잘 일치하며, 모든 테스트 케이스에 대해 최상의 예측 결과를 제공한다. 2D 예측을 위한 64 양자화 레벨(6 비트) 및 1D 예측을 위한 32 양자화 레벨(5 비트)을 사용하여 활성화 예측이 타일 수집 통신을 각각 34.0% 및 78.1% 감소시킬 수 있다. 전반적으로, 첫 번째 1D 변환이 원본 작업자에서 수행되고 대상 작업자를 우회하여 양자화 오차의 누적이 감소되므로, 1D 예측은 2D 예측에 비해 더욱 높은 예측 정확성을 나타낸다.17 shows the actual and predicted ratios of inactive tiles and lines using the F (2×2, 3×3) transform. In this embodiment, this ratio is measured using two data sets, CIFAR and ImageNet, with pre-trained weights. The inactive tile (line) indicates that the neurons transformed in the tile (line) are predicted to be all inactive in the 2D prediction case (1D prediction case). The dotted line in Fig. 17 shows the ratio measured by the actual value, and represents the upper limit of prediction. The non-uniform quantization of the four regions agrees well with the distribution of the Winograd domain values and provides the best prediction results for all test cases. Using 64 quantization levels for 2D prediction (6 bits) and 32 quantization levels for 1D prediction (5 bits), activation prediction can reduce tile collection communication by 34.0% and 78.1%, respectively. Overall, since the first 1D transformation is performed on the source worker and the accumulation of quantization errors is reduced by bypassing the target worker, 1D prediction exhibits higher prediction accuracy compared to 2D prediction.

입력 타일의 스캐터링을 위해, 입력 타일의 제로 값이 생략될 수 있다. 제로 스킵은 입력 타일 스캐터링 통신을 2D 예측 및 1D 예측 각각에 대해 39.3%, 64.7% 감소시킨다. 스킵된 데이터는 작업자 계산 내적에서 제로 값으로 채워진다. 제로 스킵에 대해 스킵된 데이터의 정보와 활성화 예측을 위해 비 활성화될 것으로 예측된 타일 데이터는 입력 및 출력 타일의 활성화 맵을 통해 대상 작업자와 원본 작업자간에 공유된다.For scattering of the input tile, the zero value of the input tile can be omitted. Zero skip reduces input tile scattering communication by 39.3% and 64.7% for 2D prediction and 1D prediction, respectively. The skipped data is filled with zero values in the operator calculation dot product. Information of the data skipped for zero skip and tile data predicted to be deactivated for activation prediction are shared between the target worker and the original worker through an activation map of input and output tiles.

6. 확장 가능한 NDP(Near-Data Processing) 아키텍처6. Extensible Near-Data Processing (NDP) architecture

본 실시예에서는 작업자로서 NDP가 이용될 수 있다. 도 1에서 도시한 바와 같이, 위노그라드 변환 컨볼루션은 연산을 줄일 수 있으나, 접근된 데이터량을 증가시킬 수 있다. 메모리의 고대역폭을 가능하게 하기 위해, 본 실시예에서는 TSVs(Through-silicon Visa)를 갖는 3D 스택 메모리와 NDP를 가능하게 하는 로직 레이어를 이용할 수 있다. 본 실시예에서는 제안된 MPT, 동적 클러스터링 및 활성화 예측을 구현하는 확장 가능한 NDP 아키텍처에 대해 논의한다. 도 18은 본 발명의 일실시예에 있어서, 제어, 연산 및 통신 유닛들로 구성되는 각 메모리 모듈의 로직 레이어에 추가되는 NDP의 예를 도시한 블록도이다.In this embodiment, NDP can be used as an operator. As shown in FIG. 1, the Winograd transform convolution can reduce computation, but can increase the amount of data accessed. In order to enable a high bandwidth of memory, in this embodiment, a 3D stack memory having a through-silicon visa (TSVs) and a logic layer enabling NDP may be used. In this embodiment, a scalable NDP architecture that implements the proposed MPT, dynamic clustering and activation prediction is discussed. 18 is a block diagram illustrating an example of an NDP added to a logic layer of each memory module composed of control, operation, and communication units in an embodiment of the present invention.

A. 제어 유닛(Control Units)A. Control Units

CNN 학습을 시작할 때, 호스트는 주어진 CNN 구조 및 파라미터들(배치 크기 및 변환 알고리즘과 같은)의 태스크 그래프를 작성한다. 노드는 단일 태스크를 나타내고 노드들간의 엣지는 데이터에 대한 종속성을 나타낼 수 있다. 본 실시예에서는 연산 단위(연산 어레이(systolic array))에 적합하고 동시에 연산되는 연산 블록으로서 태스크를 작성할 수 있다. 따라서, 하나의 컨볼루션 레이어는 여러 태스크 노드들로 구성될 수 있다. 동기화 오버헤드를 줄이기 위해, 본 실시예에서는 종속성 검사에 기반한 업데이트 카운터를 사용할 수 있다. 특성 맵은 이전 레이어에 대한 종속성을 가질 수 있으며, 가중치는 이전 반복(iteration)에 대한 종속성을 가질 수 있다. 각각의 태스크가 완료된 후, 업데이트 카운터가 증가될 수 있으며, 다음 태스크는 이전 태스크들에 연결된 모든 업데이터 카운터들이 업데이트되면 시작될 수 있다. 태스크 그래프는 각각의 NDP에 저장될 수 있고, NDP 제어에서 태스크 스케줄러는 미리 정의된 순서로 작업을 로드할 수 있으며 종속성 검사가 통과될 때 연산과 프로그램 DMA를 시작할 수 있다.When starting CNN learning, the host builds a task graph of the given CNN structure and parameters (such as the batch size and transformation algorithm). Nodes represent a single task and edges between nodes can represent dependencies on data. In this embodiment, a task can be created as an operation block suitable for a calculation unit (systolic array) and operated simultaneously. Therefore, one convolutional layer may be composed of several task nodes. To reduce synchronization overhead, an update counter based on dependency checking may be used in this embodiment. The property map can have a dependency on the previous layer, and the weight can have a dependency on the previous iteration. After each task is completed, the update counter can be incremented, and the next task can be started when all the updater counters connected to the previous tasks are updated. Task graphs can be stored in each NDP, and under NDP control, the task scheduler can load tasks in a predefined order and start computation and program DMA when dependency checking passes.

B. 연산 유닛(Computation Units)B. Computation Units

신경망 레이어의 대부분은 행렬 곱셈에 매핑될 수 있으므로, 연산 어레이는 CNN 하드웨어 가속기에 공통적으로 사용된다. 본 실시예 역시 제안된 아키텍처에서 연산 어레이를 가정한다. 본 실시예에서는 로직 레이어가 28nm에서 제조된다고 가정할 때, 로직 레이어 영역의 31%를 차지하는 64 × 64 FP32 MAC 유닛을 가정한다. MAC 유닛의 수는 영역 오버헤드뿐만 아니라 가용한 DRAM 대역폭과 연산의 균형까지 고려하여 결정될 수 있다. 연산 어레이는 어레이 경계의 양측으로부터 입력 데이터 스트림을 수신하고, 두 입력 스트림 중 하나는 새로운 출력을 계산하도록 변경될 필요가 있다. 따라서 입력 데이터의 절반이 변경되지 않고 온칩 버퍼에서 재사용되는 반면, 입력 데이터의 다른 절반은 최악의 케이스 시나리오에서 DRAM으로부터 인출(fetch)될 수 있다. 입력 데이터 스트림의 일측은 64-레인(line)이고, 레인당 4-바이트 너비이기 때문에 1GHz의 로직 주파수에서 초당 256GB의 데이터 대역폭이 요구된다. 이는 3D 스택 메모리의 초당 320G의 대역폭과 잘 매치된다. 연산과 통신의 중첩을 위해, 입력 버퍼는 이중 버퍼링을 가지며, 출력 버퍼의 용량이 128KB인 반면 컨볼루션 레이어의 계층 구조의 가중치를 완전히 저장하기 위해 입력 버퍼(총 2MB)의 각 인스턴스에 대해 512KB의 크기를 가질 수 있다. 입력 및 출력 버퍼를 위한 영역은 CACTI6.5에서 추정되었으며, 28nm에서 3.19mm²(오버 헤드 영역의 4.7%)인 것으로 나타났다. 벡터 프로세서는 활성화(ReLU), 풀링 및 특징 맵들간의 단순한 추가(잔류 연결(residual connection) 또는 조인 작업과 같은)와 같은 행렬 곱셈의 사전 및 사후 처리를 위해 NDP 연산 유닛에 프로그래밍 기능을 추가할 수 있으며 명령 시퀀스는 태스크 디스크립터로부터 로드될 수 있다. 본 실시예에서는 스크래치-패드 메모리에 기반한 벡터 프로세서를 이용할 수 있다. 여기서 벡터 프로세서는 스크래치-패드 메모리로부터 입력 데이터를 직접 로드하거나 또는 스크래치-패드 메모리에 출력 데이터를 직접 저장할 수 있다. 신경망의 레이어는 보통 입출력 데이터 크기가 크며 정적이고 규칙적인 연산을 수행할 수 있다. 컨볼루션 연산에 대한 사전 및 사후 처리는 출력 특징 맵에 ReLU/풀링을 적용하는 것과 같이 스트리밍 데이터에 대한 연산(각 데이터에 대해 한 번 연산)이 대부분이다. 이러한 특성(넓고 정렬된 입력 데이터)을 가진 스크래치-패드 메모리는 광범위한 벡터 처리 장치를 효율적으로 지원할 수 있다. 벡터 프로세서의 스크래치-패드 메모리는 DMA가 출력 데이터를 DRAM에 저장하고 입력 데이터를 미리 로드하는 이중 버퍼링 아키텍처(일례로, 각 버퍼에 대해 512KB 크기)도 제공할 수 있다.Since most of the neural network layers can be mapped to matrix multiplication, computational arrays are commonly used in CNN hardware accelerators. This embodiment also assumes a computational array in the proposed architecture. In this embodiment, assuming that the logic layer is manufactured at 28 nm, it is assumed that the 64×64 FP32 MAC unit occupies 31% of the logic layer area. The number of MAC units can be determined by considering not only the area overhead but also the available DRAM bandwidth and computational balance. The computational array receives input data streams from both sides of the array boundary, and one of the two input streams needs to be modified to compute a new output. Thus, half of the input data is unchanged and reused in the on-chip buffer, while the other half of the input data can be fetched from DRAM in the worst case scenario. Since one side of the input data stream is 64-line and 4-byte wide per lane, 256 GB of data bandwidth per second is required at a logic frequency of 1 GHz. This matches well with the bandwidth of 320G per second of 3D stack memory. For overlapping operation and communication, the input buffer has double buffering, and the output buffer has a capacity of 128 KB, while 512 KB for each instance of the input buffer (2 MB total) to completely store the weight of the hierarchical structure of the convolutional layer. Can have a size The area for the input and output buffers was estimated at CACTI6.5 and found to be 3.19 mm ² at 28 nm (4.7% of the overhead area). Vector processors can add programming capabilities to the NDP operation unit for pre- and post-processing of matrix multiplication, such as activation (ReLU), pooling, and simple additions between feature maps (such as residual connections or join operations). And the command sequence can be loaded from the task descriptor. In this embodiment, a vector processor based on a scratch-pad memory can be used. Here, the vector processor may load input data directly from the scratch-pad memory or directly store output data in the scratch-pad memory. Layers of neural networks are usually large in input/output data and can perform static and regular operations. Pre- and post-processing of convolution operations are mostly operations on streaming data (one operation for each data), such as applying ReLU/pooling to the output feature map. Scratch-pad memory with this characteristic (wide and aligned input data) can efficiently support a wide range of vector processing devices. The vector processor's scratch-pad memory can also provide a dual buffering architecture (e.g., 512KB size for each buffer) where the DMA stores the output data in DRAM and preloads the input data.

C. 통신 유닛(Communication Unit)C. Communication Unit

도 19 및 도 20은 두 통신 처리 요소들의 아키텍처를 나타내고 있다. 앞서 설명한 바와 같이, P2P(peer-to-peer)와 같이 타일 전송에 사용되는 유니캐스트 통신 로직은 변환 유닛 및 양자화 / 예측 로직을 포함할 수 있다. 활성화 예측 후, 타일 데이터를 전송하기 전에 불필요한 타일 데이터의 데이터 팩킹(또는 압축)이 수행될 수 있다. 또한, 레지스터에서 데이터의 이동을 줄이기 위해 데이터 자체를 이동시키는 것 대신 포인터-기반 쉬프트 레지스터가 이용될 수 있다. 출력 타일은 연산 어레이의 출력 버퍼로부터 자유 데이터 버퍼로 복사될 수 있다. 이때, 각 버퍼는 대상 작업자별로 지정될 수 있다. 출력 활성화 맵은 활성화 예측의 결과를 저장하고 포인터 레지스터를 부분적으로 이동(시프팅)시킴으로써 불필요한 데이터(비활성 타일)가 압축될 수 있다. 타일 수집을 위해, 포인트 레지스터에 의해 지시된 데이터가 패킷화되어 대상 작업자에게 전송될 수 있다. 제로 스킵을 이용한 스캐터링을 위해, 스킵된 데이터는 원본 작업자와 대상 작업자 사이에 공유되는 입력 활성화 맵을 참조하여 작업자의 수신측에서 0으로 채워질 수 있다.19 and 20 show the architecture of the two communication processing elements. As described above, unicast communication logic used for tile transmission such as peer-to-peer (P2P) may include a transformation unit and quantization/prediction logic. After activation prediction, data packing (or compression) of unnecessary tile data may be performed before transmitting tile data. Also, instead of moving the data itself to reduce the movement of data in the register, a pointer-based shift register can be used. The output tile can be copied from the output buffer of the computational array to the free data buffer. At this time, each buffer may be designated for each target operator. The output activation map stores the result of the activation prediction and partially moves (shifts) the pointer register so that unnecessary data (inactive tiles) can be compressed. For tile collection, the data indicated by the point register can be packetized and transmitted to the target worker. For scattering using zero skip, the skipped data may be filled with zeros at the receiving side of the worker by referring to the input activation map shared between the source worker and the target worker.

가중치의 집단 통신(collective communication)을 위해, 링-기반 집단 동작이 제안된 하드웨어 아키텍처(도 20)에서 사용될 수 있다. 메시지는 여러 청크들로 분할되어, 파이프라인 전송으로서 참조되는 바와 같이, 병렬로 전송될 수 있다. 각 청크는 메모리-중심 네트워크 내부의 단일 패킷에 매핑될 수 있다. 느린 작업자에 의해 전체 링이 차단되는 것을 피하고 대역폭을 최대화하기 위해, 여러 메시지들의 집단 동작이 동시에 처리될 수 있다. 동일한 메시지의 청크는 미리 정의된 순서로 도착하지만 다른 메시지의 청크는 순서에 관계없이 도착할 수 있다. 미리 정의된 순서 없이 도착한 청크를 적절히 처리하기 위해, 여러 감소 블록들(Reduce blocks)과 통신 버퍼들이 구현될 수 있다. 이때, 각 감소 블록에는 특정 메시지가 할당될 수 있다. 링크 또는 로컬 연산 유닛으로부터 청크가 도착하면, 감소 블록은 청크가 연결된 통신 버퍼에 있는지 여부를 확인하고, 통신 버퍼에 청크를 갱신하거나(청크가 존재하는 경우) 저장한다(청크가 존재하지 않는 경우). 다음 작업자가 청크를 수신할 준비가 되면, 업데이트된 청크는 다음 작업자에게 전송될 수 있다. 다음 작업자 또는 대상 작업자는 링 방향 및 동적 클러스터링 구성에 의해 결정될 수 있다. 가중치 그라디언트 감소가 완료된 후, DRAM에 저장되기 전에 모든 가중치가 업데이트되고 브로드캐스트될 수 있다.For collective communication of weights, ring-based collective operation can be used in the proposed hardware architecture (FIG. 20). The message may be divided into several chunks and transmitted in parallel, as referred to as pipeline transmission. Each chunk can be mapped to a single packet inside a memory-centric network. In order to avoid blocking the entire ring by slow workers and to maximize bandwidth, the collective action of multiple messages can be processed simultaneously. Chunks of the same message arrive in a predefined order, but chunks of different messages can arrive in any order. In order to properly process chunks arriving in a predefined order, several reduce blocks and communication buffers can be implemented. At this time, a specific message may be assigned to each reduction block. When a chunk arrives from a link or local operation unit, the decrement block checks whether the chunk is in the connected communication buffer, and updates the chunk in the communication buffer (if the chunk exists) or stores it (if the chunk does not exist). . When the next worker is ready to receive the chunk, the updated chunk can be sent to the next worker. The next worker or target worker can be determined by the ring direction and dynamic clustering configuration. After the weight gradient reduction is complete, all weights can be updated and broadcast before being stored in DRAM.

도 21은 본 발명의 일실시예에 따른 컴퓨터 장치의 예를 도시한 블록도이다. 본 실시예에 따른 다차원 병렬 학습 방법은 적어도 하나의 컴퓨터 장치에 의해 수행될 수 있으며, 적어도 하나의 컴퓨터 장치 각각은 도 21에 도시된 컴퓨터 장치(2100)와 같은 내부 구성을 포함할 수 있다.21 is a block diagram showing an example of a computer device according to an embodiment of the present invention. The multi-dimensional parallel learning method according to the present embodiment may be performed by at least one computer device, and each of the at least one computer device may include an internal configuration such as the computer device 2100 illustrated in FIG. 21.

이러한 컴퓨터 장치(2100)는 도 21에 도시된 바와 같이, 메모리(2110), 프로세서(2120), 통신 인터페이스(2130) 그리고 입출력 인터페이스(2140)를 포함할 수 있다. 메모리(2110)는 컴퓨터에서 판독 가능한 기록매체로서, RAM(random access memory), ROM(read only memory) 및 디스크 드라이브와 같은 비소멸성 대용량 기록장치(permanent mass storage device)를 포함할 수 있다. 여기서 ROM과 디스크 드라이브와 같은 비소멸성 대용량 기록장치는 메모리(2110)와는 구분되는 별도의 영구 저장 장치로서 컴퓨터 장치(2100)에 포함될 수도 있다. 또한, 메모리(2110)에는 운영체제와 적어도 하나의 프로그램 코드가 저장될 수 있다. 이러한 소프트웨어 구성요소들은 메모리(2110)와는 별도의 컴퓨터에서 판독 가능한 기록매체로부터 메모리(2110)로 로딩될 수 있다. 이러한 별도의 컴퓨터에서 판독 가능한 기록매체는 플로피 드라이브, 디스크, 테이프, DVD/CD-ROM 드라이브, 메모리 카드 등의 컴퓨터에서 판독 가능한 기록매체를 포함할 수 있다. 다른 실시예에서 소프트웨어 구성요소들은 컴퓨터에서 판독 가능한 기록매체가 아닌 통신 인터페이스(2130)를 통해 메모리(2110)에 로딩될 수도 있다. 예를 들어, 소프트웨어 구성요소들은 네트워크(2160)를 통해 수신되는 파일들에 의해 설치되는 컴퓨터 프로그램에 기반하여 컴퓨터 장치(2100)의 메모리(2110)에 로딩될 수 있다.21, the computer device 2100 may include a memory 2110, a processor 2120, a communication interface 2130, and an input/output interface 2140. The memory 2110 is a computer-readable recording medium, and may include a non-permanent mass storage device such as random access memory (RAM), read only memory (ROM), and a disk drive. Here, a non-destructive large-capacity recording device such as a ROM and a disk drive may be included in the computer device 2100 as a separate permanent storage device separate from the memory 2110. Also, an operating system and at least one program code may be stored in the memory 2110. These software components may be loaded into the memory 2110 from a computer-readable recording medium separate from the memory 2110. Such a separate computer-readable recording medium may include a computer-readable recording medium such as a floppy drive, disk, tape, DVD/CD-ROM drive, and memory card. In other embodiments, software components may be loaded into memory 2110 through communication interface 2130 rather than a computer-readable recording medium. For example, software components may be loaded into memory 2110 of computer device 2100 based on a computer program installed by files received over network 2160.

프로세서(2120)는 기본적인 산술, 로직 및 입출력 연산을 수행함으로써, 컴퓨터 프로그램의 명령을 처리하도록 구성될 수 있다. 명령은 메모리(2110) 또는 통신 인터페이스(2130)에 의해 프로세서(2120)로 제공될 수 있다. 예를 들어 프로세서(2120)는 메모리(2110)와 같은 기록 장치에 저장된 프로그램 코드에 따라 수신되는 명령을 실행하도록 구성될 수 있다.The processor 2120 may be configured to process instructions of a computer program by performing basic arithmetic, logic, and input/output operations. Instructions may be provided to processor 2120 by memory 2110 or communication interface 2130. For example, the processor 2120 may be configured to execute an instruction received according to program code stored in a recording device such as the memory 2110.

통신 인터페이스(2130)은 네트워크(2160)를 통해 컴퓨터 장치(2100)가 다른 장치와 서로 통신하기 위한 기능을 제공할 수 있다. 일례로, 컴퓨터 장치(2100)의 프로세서(2120)가 메모리(2110)와 같은 기록 장치에 저장된 프로그램 코드에 따라 생성한 요청이나 명령, 데이터, 파일 등이 통신 인터페이스(2130)의 제어에 따라 네트워크(2160)를 통해 다른 장치들로 전달될 수 있다. 역으로, 다른 장치로부터의 신호나 명령, 데이터, 파일 등이 네트워크(2160)를 거쳐 컴퓨터 장치(2100)의 통신 인터페이스(2130)를 통해 컴퓨터 장치(2100)로 수신될 수 있다. 통신 인터페이스(2130)를 통해 수신된 신호나 명령, 데이터 등은 프로세서(2120)나 메모리(2110)로 전달될 수 있고, 파일 등은 컴퓨터 장치(2100)가 더 포함할 수 있는 저장 매체(상술한 영구 저장 장치)로 저장될 수 있다.The communication interface 2130 may provide a function for the computer device 2100 to communicate with each other through the network 2160. For example, requests, commands, data, files, etc. generated by the processor 2120 of the computer device 2100 according to program codes stored in a recording device such as the memory 2110 are controlled by the communication interface 2130. 2160) to other devices. Conversely, signals, commands, data, files, and the like from other devices may be received by the computer device 2100 through the communication interface 2130 of the computer device 2100 via the network 2160. Signals, instructions, data, etc. received through the communication interface 2130 may be transferred to the processor 2120 or the memory 2110, and files and the like may be further stored by the computer device 2100 (described above) Permanent storage device).

입출력 인터페이스(2140)는 입출력 장치(2150)와의 인터페이스를 위한 수단일 수 있다. 예를 들어, 입력 장치는 마이크, 키보드 또는 마우스 등의 장치를, 그리고 출력 장치는 디스플레이, 스피커와 같은 장치를 포함할 수 있다. 다른 예로 입출력 인터페이스(2140)는 터치스크린과 같이 입력과 출력을 위한 기능이 하나로 통합된 장치와의 인터페이스를 위한 수단일 수도 있다. 입출력 장치(2150)는 컴퓨터 장치(2100)와 하나의 장치로 구성될 수도 있다.The input/output interface 2140 may be a means for interfacing with the input/output device 2150. For example, the input device may include a device such as a microphone, keyboard or mouse, and the output device may include a device such as a display or speaker. As another example, the input/output interface 2140 may be a means for interfacing with a device in which functions for input and output are integrated into one, such as a touch screen. The input/output device 2150 may be configured as a computer device 2100 and a single device.

또한, 다른 실시예들에서 컴퓨터 장치(2100)는 도 21의 구성요소들보다 더 적은 혹은 더 많은 구성요소들을 포함할 수도 있다. 그러나, 대부분의 종래기술적 구성요소들을 명확하게 도시할 필요성은 없다. 예를 들어, 컴퓨터 장치(2100)는 상술한 입출력 장치(2150) 중 적어도 일부를 포함하도록 구현되거나 또는 트랜시버(transceiver), 데이터베이스 등과 같은 다른 구성요소들을 더 포함할 수도 있다.Also, in other embodiments, the computer device 2100 may include fewer or more components than those in FIG. 21. However, there is no need to clearly show most prior art components. For example, the computer device 2100 may be implemented to include at least some of the input/output devices 2150 described above, or may further include other components such as a transceiver, a database, and the like.

통신 방식은 제한되지 않으며, 네트워크(2160)가 포함할 수 있는 통신망(일례로, 이동통신망, 유선 인터넷, 무선 인터넷, 방송망)을 활용하는 통신 방식뿐만 아니라 블루투스(Bluetooth)나 NFC(Near Field Communication)와 같은 근거리 무선 통신 역시 포함될 수 있다. 예를 들어, 네트워크(2160)는, PAN(personal area network), LAN(local area network), CAN(campus area network), MAN(metropolitan area network), WAN(wide area network), BBN(broadband network), 인터넷 등의 네트워크 중 하나 이상의 임의의 네트워크를 포함할 수 있다. 또한, 네트워크(2160)는 버스 네트워크, 스타 네트워크, 링 네트워크, 메쉬 네트워크, 스타-버스 네트워크, 트리 또는 계층적(hierarchical) 네트워크 등을 포함하는 네트워크 토폴로지 중 임의의 하나 이상을 포함할 수 있으나, 이에 제한되지 않는다.The communication method is not limited, and a communication method that utilizes a communication network (for example, a mobile communication network, a wired Internet, a wireless Internet, a broadcast network) that the network 2160 may include, as well as Bluetooth or NFC (Near Field Communication). Short-range wireless communications such as can also be included. For example, the network 2160 includes a personal area network (PAN), a local area network (LAN), a campus area network (CAN), a metropolitan area network (MAN), a wide area network (WAN), and a broadband network (BBN). , Any one or more of the networks such as the Internet. Further, the network 2160 may include any one or more of a network topology including a bus network, a star network, a ring network, a mesh network, a star-bus network, a tree or a hierarchical network, etc. It is not limited.

도 22는 본 발명의 일실시예에 따른 데이터 처리 방법의 예를 도시한 흐름도이다. 본 실시예에 따른 데이터 처리 방법은 일례로 앞서 설명한 컴퓨터 장치(2100)에 의해 수행될 수 있다. 예를 들어, 컴퓨터 장치(2100)의 프로세서(2120)는 메모리(2110)가 포함하는 운영체제의 코드나 적어도 하나의 프로그램의 코드에 따른 제어 명령(instruction)을 실행하도록 구현될 수 있다. 여기서, 프로세서(2120)는 컴퓨터 장치(2100)에 저장된 코드가 제공하는 제어 명령에 따라 컴퓨터 장치(2100)가 도 22의 방법이 포함하는 단계들(2210 내지 2240)을 수행하도록 컴퓨터 장치(2100)를 제어할 수 있다.22 is a flowchart illustrating an example of a data processing method according to an embodiment of the present invention. The data processing method according to the present embodiment may be performed by the computer device 2100 described above as an example. For example, the processor 2120 of the computer device 2100 may be implemented to execute control instructions according to code of an operating system included in the memory 2110 or code of at least one program. Here, the processor 2120 is a computer device 2100 so that the computer device 2100 performs steps 2210 to 2240 included in the method of FIG. 22 according to a control command provided by a code stored in the computer device 2100. Can be controlled.

단계(2210)에서 컴퓨터 장치(2100)는 입력 특성 맵을 포함하는 입력 데이터를 복수의 타일로 변환할 수 있다. 예를 들어, 컴퓨터 장치(2100)는 입력 특성 맵을 T × T의 타일 크기를 갖는 t 개의 타일로 변환하여, 입력 채널 당 총 B × t 개의 타일을 생성할 수 있다.In operation 2210, the computer device 2100 may convert input data including an input characteristic map into a plurality of tiles. For example, the computer device 2100 may convert the input characteristic map to t tiles having a tile size of T × T , thereby generating a total of B × t tiles per input channel.

단계(2220)에서 컴퓨터 장치(2100)는 복수의 클러스터와 복수의 그룹에 의해 2차원으로 배치된 복수의 작업자들에게 변환된 복수의 타일들을 분산 전달하되, 변환된 복수의 타일들을 복수의 클러스터별로 분산하여 전달할 수 있다. 이때, 복수의 그룹 각각에 포함된 작업자들은 가중치의 집단 연산을 지원하는 링 토폴로지를 통해 상호 연결되고, 복수의 클러스터 각각에 포함된 작업자들은 타일 수집/분산을 위해 올-투-올 트래픽(all-to-all traffic)을 효율적으로 지원하는 고연결성(high-connectivity)의 토폴로지를 통해 상호 연결될 수 있다. 일례로, 고연결성의 토폴로지는 2D FBFLY(flattened butterfly) 토폴로지 또는 드래곤플라이(Dragonfly) 토폴로지)를 포함할 수 있다.In step 2220, the computer device 2100 distributes the converted plurality of tiles to a plurality of workers arranged in two dimensions by a plurality of clusters and a plurality of groups, and converts the converted plurality of tiles into a plurality of clusters. It can be distributed and delivered. At this time, workers included in each of the plurality of groups are interconnected through a ring topology that supports collective calculation of weights, and workers included in each of the plurality of clusters are all-to-all traffic (all-) for tile collection/distribution. It can be interconnected through a high-connectivity topology that efficiently supports to-all traffic. As an example, the high connectivity topology may include a 2D flattened butterfly (FBFLY) topology or a dragonfly topology (FBFLY).

실시예에 따라, 컴퓨터 장치(2100)는 입력 특성 맵의 크기 및 가중치 크기 중 적어도 하나에 따라 복수의 그룹들을 서로 연결하는 호스트를 통해 복수의 클러스터의 수와 복수의 그룹의 수를 동적으로 변경할 수 있다. 일례로, 호스트는 도 13(a)에 나타난 'Host CPU'에 대응할 수 있다.According to an embodiment, the computer device 2100 may dynamically change the number of the plurality of clusters and the number of the plurality of groups through a host connecting the plurality of groups to each other according to at least one of the size and weight size of the input characteristic map. have. In one example, the host may correspond to the'Host CPU' shown in Figure 13 (a).

단계(2230)에서 컴퓨터 장치(2100)는 복수의 클러스터 각각의 작업자들을 통해, 분산 전달된 입력 데이터에 대한 데이터 병렬 처리(data parallelism)를 수행할 수 있다.In operation 2230, the computer device 2100 may perform data parallelism on distributedly transmitted input data through workers of each of the plurality of clusters.

단계(2240)에서 컴퓨터 장치(2100)는 복수의 그룹 각각의 작업자들을 통해, 분산 전달된 입력 데이터 중 복수의 그룹 각각에 적용되는 복수의 타일들의 요소 단위에 대한 인트라-타일 병렬 처리(intra-tile parallelism)를 수행할 수 있다. 예를 들어, 컴퓨터 장치(2100)는 위노그라드 도메인 내적에서 상기 복수의 타일들 각각의 동일한 위치에 배치된 요소 단위의 내적을 통해 상기 인트라-타일 병렬 처리를 수행할 수 있다.In operation 2240, the computer device 2100 performs intra-tile parallel processing of element units of a plurality of tiles applied to each of a plurality of groups among input data distributed through a plurality of groups of respective workers. parallelism). For example, the computer device 2100 may perform the intra-tile parallel processing through the dot product of the element unit disposed at the same location of each of the plurality of tiles in the Winograd domain dot product.

이때, 복수의 작업자들 중 원본(source) 작업자가 대상(destination) 작업자에게 타일을 전송함에 있어서, 실제 값을 전송하기 전에 예측을 위한 타일 요소의 양자화된 값을 전송하고, 대상 작업자가 상기 양자화된 값으로부터 공간 도메인 뉴런의 활성화를 예측하여 불필요한 타일 수집을 스킵할 수 있다.At this time, when a source worker among a plurality of workers transmits a tile to a destination worker, the quantized value of the tile element for prediction is transmitted before the actual value is transmitted, and the target worker quantizes the above. Predicting activation of spatial domain neurons from the value can skip unnecessary tile collection.

또한, 복수의 작업자 각각은 제어부, 연산부 및 통신부를 포함하는 확장 가능한 NDP(Near-Data Processing) 아키텍처에 의해 구현될 수 있다. 예를 들어, 복수의 작업자 각각은 호스트에서 작성되는 태스크 그래프를 저장하고, 상기 태스크 그래프에 기초하여 미리 정의된 순서로 작업을 로드하며, 종속성 검사에 기반한 업데이트 카운터를 제어하는 제어부, 행렬 곱셈을 위한 연산 어레이(Systolic Array) 및 상기 행렬 곱셈의 사전 및 사후 처리를 위해 프로그래밍 기능을 추가하기 위한 벡터 프로세서를 포함하는 연산부 및 유니캐스트 통신 로직 및 링-기반 집단 통신 로직을 포함하는 통신부를 포함할 수 있다.Further, each of the plurality of workers may be implemented by an expandable NDP (Near-Data Processing) architecture including a control unit, a calculation unit, and a communication unit. For example, each of the plurality of workers stores a task graph created on the host, loads the tasks in a predefined order based on the task graph, and controls a control counter based on dependency checking, for matrix multiplication It may include a computational unit including a Systolic Array and a vector processor for adding a programming function for pre- and post-processing of the matrix multiplication and a communication unit including unicast communication logic and ring-based collective communication logic. .

이와 같이, 본 발명의 실시예들에 따르면, 위노그라드 도메인에서 사용할 수 있는 데이터와 인트라-타일(또는 요소 단위) 병렬 처리를 모두 활용하는 위노그라드 도메인 컨볼루션의 새로운 다차원 병렬 학습(Multi-dimensional Parallel Training, MPT)을 제공할 수 있다. 또한, MPT를 사용하여 작업자 간의 통신을 효율적으로 지원하기 위해, 데이터 병렬 처리에서 집단 통신을 위한 링 토폴로지 및 타일 전송에 높은 연결 토폴로지를 사용하는 하이브리드 토폴로지를 가진 작업자의 2D 조직을 제공할 수 있다. 또한, MPT 구조에서 작업자가 두 가지 유형의 통신(가중치 그라디언트 및 타일 전송)의 균형을 최적화할 수 있도록 하는 동적 클러스터링(작업자 사이의 연결 구조의 재작성)을 제공할 수 있다. 또한, 타일 수집에 필요한 대역폭을 감소시키기 위해 정확도 손실 없이 위노그라드 도메인 데이터의 양자화된 값으로부터 공간 도메인 뉴런의 활성화 예측을 제공할 수 있다.As described above, according to embodiments of the present invention, a new multi-dimensional parallel learning of Winnograd domain convolution that utilizes both data and intra-tile (or element-by-element) parallel processing that can be used in the Winograd domain Training, MPT). In addition, to efficiently support communication between workers using MPT, it is possible to provide a 2D organization of workers with a hybrid topology using a ring topology for collective communication and a high connection topology for tile transmission in data parallel processing. In addition, the MPT architecture can provide dynamic clustering (rewrite of the connection structure between workers) that allows the operator to optimize the balance of two types of communication (weighted gradients and tile transfer). In addition, it is possible to provide prediction of activation of spatial domain neurons from quantized values of the Winograd domain data without loss of accuracy in order to reduce the bandwidth required for tile collection.

이상에서 설명된 시스템 또는 장치는 하드웨어 구성요소, 또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 어플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The system or device described above may be implemented as a hardware component, or a combination of hardware components and software components. For example, the devices and components described in the embodiments include, for example, processors, controllers, arithmetic logic units (ALUs), digital signal processors (micro signal processors), microcomputers, field programmable gate arrays (FPGAs). , A programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions, may be implemented using one or more general purpose computers or special purpose computers. The processing device may perform an operating system (OS) and one or more software applications running on the operating system. In addition, the processing device may access, store, manipulate, process, and generate data in response to the execution of the software. For convenience of understanding, a processing device may be described as one being used, but a person having ordinary skill in the art, the processing device may include a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that may include. For example, the processing device may include a plurality of processors or a processor and a controller. In addition, other processing configurations, such as parallel processors, are possible.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록매체에 저장될 수 있다.The software may include a computer program, code, instruction, or a combination of one or more of these, and configure the processing device to operate as desired, or process independently or collectively You can command the device. Software and/or data may be interpreted by a processing device, or to provide instructions or data to a processing device, of any type of machine, component, physical device, virtual equipment, computer storage medium or device. Can be embodied in The software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 매체는 컴퓨터로 실행 가능한 프로그램을 계속 저장하거나, 실행 또는 다운로드를 위해 임시 저장하는 것일 수도 있다. 또한, 매체는 단일 또는 수개 하드웨어가 결합된 형태의 다양한 기록수단 또는 저장수단일 수 있는데, 어떤 컴퓨터 시스템에 직접 접속되는 매체에 한정되지 않고, 네트워크 상에 분산 존재하는 것일 수도 있다. 매체의 예시로는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM 및 DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical medium), 및 ROM, RAM, 플래시 메모리 등을 포함하여 프로그램 명령어가 저장되도록 구성된 것이 있을 수 있다. 또한, 다른 매체의 예시로, 애플리케이션을 유통하는 앱 스토어나 기타 다양한 소프트웨어를 공급 내지 유통하는 사이트, 서버 등에서 관리하는 기록매체 내지 저장매체도 들 수 있다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, or the like alone or in combination. The medium may be a computer that continuously stores executable programs or may be temporarily stored for execution or download. In addition, the medium may be various recording means or storage means in the form of a combination of single or several hardware, and is not limited to a medium directly connected to a computer system, but may be distributed on a network. Examples of the medium include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical recording media such as CD-ROMs and DVDs, and magneto-optical media such as floptical disks, And program instructions including ROM, RAM, flash memory, and the like. In addition, examples of other media may include an application store for distributing applications or a recording medium or storage medium managed by a site, server, or the like that supplies or distributes various software. Examples of program instructions include high-level language code that can be executed by a computer using an interpreter, etc., as well as machine language codes produced by a compiler.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described by a limited embodiment and drawings, those skilled in the art can make various modifications and variations from the above description. For example, the described techniques are performed in a different order than the described method, and/or the components of the described system, structure, device, circuit, etc. are combined or combined in a different form from the described method, or other components Alternatively, even if replaced or substituted by equivalents, appropriate results can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

In the multi-dimensional parallel training method for Winograd transform convolution,
Distributing and transmitting input data to a plurality of workers, which are arranged in two dimensions by a plurality of clusters and a plurality of groups, by distributing the plurality of tiles for each of the plurality of clusters;
Performing data parallelism on the distributedly transmitted input data through each worker of the plurality of clusters; And
Performing intra-tile parallelism on the element unit of the plurality of tiles applied to each of the plurality of groups among the plurality of groups of distributed input data through the workers of each of the plurality of groups.
Multidimensional parallel learning method comprising a.

According to claim 1,
The step of performing the intra-tile parallel processing,
A multi-dimensional parallel learning method characterized in that the intra-tile parallel processing is performed through a dot product of element units arranged at the same location in each of the plurality of tiles in the Winnograd domain dot product.

According to claim 1,
Workers included in each of the plurality of groups are interconnected through a ring topology supporting collective calculation of weights,
Workers included in each of the plurality of clusters are interconnected through a high-connectivity topology that efficiently supports all-to-all traffic for tile collection/distribution.
Multi-dimensional parallel learning method characterized in that.

According to claim 1,
Dynamically changing the number of the plurality of clusters and the number of the plurality of groups through a host connecting the plurality of groups to each other according to at least one of the size and weight size of the input characteristic map.
Multi-dimensional parallel learning method further comprising a.

According to claim 1,
Among the plurality of workers, a source worker transmits a quantized value of a tile element for prediction before transmitting a real value to a destination worker, and the target worker uses a spatial domain neuron from the quantized value. A multi-dimensional parallel learning method characterized by skipping unnecessary tile collection by predicting activation.

According to claim 1,
Each of the plurality of workers is implemented in a scalable NDP (Near-Data Processing) architecture including a control unit, a calculation unit, and a communication unit.

A computer program stored in a computer-readable recording medium in combination with a computer device for executing the method of any one of claims 1 to 6 in a computer device.

A computer-readable recording medium on which a computer program for executing the method of any one of claims 1 to 6 is recorded on a computer device.

In the computer device for performing a multi-dimensional parallel training (Width-of-the-Line) method for Winograd transform convolution,
At least one processor implemented to execute readable instructions on the computer device
Including,
By the at least one processor,
Distributing and transmitting input data to a plurality of tiles in a plurality of workers arranged in two dimensions by a plurality of clusters and a plurality of groups, and distributing and transmitting the plurality of tiles for each of the plurality of clusters,
Through the workers of each of the plurality of clusters, data parallelism is performed on the input data that is distributed and transmitted,
Performing intra-tile parallelism for element units of the plurality of tiles applied to each of the plurality of groups among the plurality of groups of distributed input data through the workers of each of the plurality of groups.
Computer device characterized in that.

The method of claim 9,
Each of the plurality of workers,
A control unit that stores a task graph created in the host, loads tasks in a predefined order based on the task graph, and controls an update counter based on dependency checking;
An operation unit including a systolic array for matrix multiplication and a vector processor for adding a programming function for pre- and post-processing of the matrix multiplication; And
Communication unit including unicast communication logic and ring-based collective communication logic
Computer device comprising a.