KR102335955B1

KR102335955B1 - Convolution neural network system and operation method thereof

Info

Publication number: KR102335955B1
Application number: KR1020170028471A
Authority: KR
Inventors: 김진규; 김병조; 김성민; 김주엽; 이미영; 이주현
Original assignee: 한국전자통신연구원
Priority date: 2016-11-07
Filing date: 2017-03-06
Publication date: 2021-12-08
Also published as: KR20180052063A

Abstract

본 발명의 실시 예에 따른 컨볼루션 신경망 시스템은 희소 가중치 커널에서 '0'이 아닌 값(non-zero value)의 위치를 가리키는 희소 인덱스를 기반으로, 입력 데이터의 입력 값들 중 희소 가중치의 위치와 대응되는 입력 값을 출력하도록 구성되는 데이터 선택기, 및 희소 가중치 커널을 이용하여, 데이터 선택기로부터 출력되는 입력 값에 대한 컨볼루션 연산을 수행하도록 구성되는 곱셈-누산(MAC; multiply accumulate) 연산기를 포함한다.The convolutional neural network system according to an embodiment of the present invention corresponds to a location of a sparse weight among input values of input data based on a sparse index indicating a location of a non-zero value in a sparse weight kernel. and a data selector configured to output an input value that becomes

Description

Convolutional neural network system and its operation method

본 발명은 심층 신경망(deep neural network)에 관한 것으로서, 더욱 상세하게는 컨볼루션 신경망 시스템 및 그것의 동작 방법에 관한 것이다.The present invention relates to a deep neural network, and more particularly, to a convolutional neural network system and an operating method thereof.

최근 영상 인식을 위한 기술로 심층 신경망(Deep Neural Network) 기법의 하나인 컨볼루션 신경망(Convolutional Neural Network; 이하, CNN)이 활발하게 연구되고 있다. 신경망 구조는 사물 인식이나 필기체 인식 등 다양한 객체 인지 분야에서 뛰어난 성능을 보이고 있다. 특히, CNN은 객체 인식에 매우 효과적인 성능을 제공하고 있다. Recently, as a technology for image recognition, a convolutional neural network (CNN), which is one of deep neural network techniques, is being actively studied. The neural network structure shows excellent performance in various object recognition fields such as object recognition and handwriting recognition. In particular, CNN provides very effective performance for object recognition.

최근에는 효율적인 CNN 구조가 제시되면서, 신경망을 이용한 인식률을 거의 인간이 인식할 수 있는 수준에까지 이르렀다. 그러나 CNN신경망 구조가 매우 복잡하기 때문에, 큰 연산량이 필요하고, 이에 따라, 고성능 서버 또는 GPU를 이용한 하드웨어 가속 방법이 사용된다. CNN 구조에서, 내부적으로 발생하는 대부분의 연산은 곱셈-누산기(MAC; Multiply-Accumulator)를 사용하여 수행된다. 그러나, CNN 신경망 내의 노드 간 연결 수가 매우 많고 곱셈을 요구하는 파라미터의 수가 많기 때문에 학습과정이나 인식과정에서 큰 연산량이 요구되고, 이로 인하여 큰 하드웨어 자원이 요구된다.Recently, as an efficient CNN structure has been proposed, the recognition rate using a neural network has reached a level that can almost be recognized by humans. However, since the structure of the CNN neural network is very complex, a large amount of computation is required. Accordingly, a hardware acceleration method using a high-performance server or GPU is used. In the CNN structure, most of the operations that occur internally are performed using a Multiply-Accumulator (MAC). However, since the number of connections between nodes in the CNN neural network is very large and the number of parameters requiring multiplication is large, a large amount of computation is required in the learning process or recognition process, which requires large hardware resources.

본 발명의 목적은 상술된 기술적 과제를 해결하기 위한 것으로써, 신경망 압축 기술에 따라 생성되는 희소 가중치를 기반으로 컨볼루션 신경망에서, 컨볼루션 연산을 감소시킬 수 있는 컨볼루션 신경망 시스템 및 그것의 동작 방법을 제공하는데 있다.SUMMARY OF THE INVENTION An object of the present invention is to solve the above-described technical problem, and a convolutional neural network system capable of reducing convolution operations in a convolutional neural network based on sparse weights generated according to a neural network compression technique, and an operating method thereof is to provide

또한, 본 발명은 희소 가중치를 사용하는 컨볼루션 신경망 시스템에 있어서 효과적인 연산 방법 및 장치를 제공하며, 이에 따른 연산 수행 시간을 단축시켜 전체적인 성능을 향상시키는 것을 목적으로 한다.In addition, an object of the present invention is to provide an effective calculation method and apparatus in a convolutional neural network system using sparse weights, and to improve overall performance by shortening the calculation execution time.

본 발명의 실시 예에 따른 컨볼루션 신경망 시스템은 희소 가중치 커널에서 '0'이 아닌 값(non-zero value)의 위치를 가리키는 희소 인덱스를 기반으로, 입력 데이터의 입력 값들 중 상기 희소 가중치의 상기 위치와 대응되는 입력 값을 출력하도록 구성되는 데이터 선택기, 및 상기 희소 가중치 커널을 이용하여, 상기 데이터 선택기로부터 출력되는 상기 입력 값에 대한 컨볼루션 연산을 수행하도록 구성되는 곱셈-누산(MAC, multiply accumulate) 연산기를 포함하고, 상기 희소 가중치 커널은 적어도 하나의 '0'인 가중치 값을 포함한다.The convolutional neural network system according to an embodiment of the present invention is based on a sparse index indicating a position of a non-zero value in a sparse weight kernel, the position of the sparse weight among input values of input data. a data selector configured to output an input value corresponding to and an operator, wherein the sparse weight kernel includes at least one weight value equal to '0'.

실시 예로서, 상기 데이터 선택기는 입력 값들 중 상기 희소 가중치 커널에서, '0'인 값의 위치와 대응되는 입력 값을 출력하지 않도록 구성된다.In an embodiment, the data selector is configured not to output an input value corresponding to a position of a value of '0' in the sparse weight kernel among input values.

실시 예로서, 상기 컨볼루션 신경망 시스템은 외부 메모리로부터 상기 입력 데이터의 일부인 입력 타일을 저장하도록 구성되는 입력 버퍼 장치, 및 상기 MAC 연산기로부터의 상기 컨볼루션 연산의 결과 값을 저장하고, 상기 저장된 결과 값을 상기 외부 메모리로 제공하도록 구성되는 출력 버퍼 장치를 더 포함한다.In an embodiment, the convolutional neural network system stores an input buffer device configured to store an input tile that is a part of the input data from an external memory, and a result value of the convolution operation from the MAC operator, and the stored result value and an output buffer device configured to provide to the external memory.

실시 예로서, 상기 컨볼루션 신경망 시스템은 외부 메모리로부터 상기 희소 가중치 커널을 수신하고, 상기 수신된 희소 가중치 커널을 상기 MAC 연산기로 제공하고, 상기 희소 가중치 커널의 상기 희소 인덱스를 상기 데이터 선택기로 제공하도록 구성되는 가중치 커널 버퍼 장치를 더 포함한다..In an embodiment, the convolutional neural network system receives the sparse weight kernel from an external memory, provides the received sparse weight kernel to the MAC operator, and provides the sparse index of the sparse weight kernel to the data selector. It further comprises a weighted kernel buffer device configured.

실시 예로서, 상기 데이터 선택기는 스위치 회로, 및 복수의 멀티플렉서(MUX, multiplexer)를 포함한다. 상기 스위치 회로는 상기 희소 가중치 커널을 기반으로 상기 입력 값들 각각을 상기 복수의 MUX로 제공하도록 구성되고, 상기 복수의 MUX 각각은 상기 희소 인덱스를 기반으로, 상기 스위치 회로에 의해 제공되는 상기 입력 값들 중 상기 희소 가중치의 위치와 대응되는 입력 값을 선택하여 출력하도록 구성된다.In an embodiment, the data selector includes a switch circuit and a plurality of multiplexers (MUXs). the switch circuit is configured to provide each of the input values to the plurality of MUXs based on the sparse weight kernel, each of the plurality of MUXs among the input values provided by the switch circuit based on the sparse index and select and output an input value corresponding to the position of the sparse weight.

실시 예로서, 상기 MAC 연산기는 상기 복수의 MUX 각각으로부터 출력되는 입력 값을 각각 수신하고, 상기 희소 가중치 커널을 기반으로 상기 수신된 입력 값에 대한 컨볼루션 연산을 각각 수행하도록 구성되는 복수의 MAC 코어를 포함한다.As an embodiment, the MAC operator receives an input value output from each of the plurality of MUXs, and a plurality of MAC cores configured to respectively perform a convolution operation on the received input value based on the sparse weight kernel includes

실시 예로서, 상기 복수의 MAC 코어 각각은 상기 입력 값 및 상기 희소 가중치에 대한 곱셈 연산을 수행하도록 구성되는 곱셈기, 상기 곱셈 연산의 결과 및 이전 덧셈 연산의 결과에 대한 덧셈 연산을 수행하도록 구성되는 가산기, 및 상기 덧셈 연산의 결과를 저장하도록 구성되는 레지스터를 포함한다.As an embodiment, each of the plurality of MAC cores includes a multiplier configured to perform a multiplication operation on the input value and the sparse weight, an adder configured to perform an addition operation on a result of the multiplication operation and a result of a previous addition operation , and a register configured to store a result of the addition operation.

실시 예로서, 상기 희소 가중치 커널은 신경망 압축을 통해 완전 가중치 커널로부터 변환된 가중치 커널이고, 상기 완전 가중치 커널은 '0'이 아닌 가중치 값들로 구성된 컨볼루션 신경망 시스템.As an embodiment, the sparse weight kernel is a weight kernel converted from a full weight kernel through neural network compression, and the full weight kernel is a convolutional neural network system configured with weight values other than '0'.

실시 예로서, 상기 신경망 압축은 상기 완전 가중치 커널에 대한 파라미터 제거 기법, 가중치 공유 기법, 또는 파라미터 양자화 기법 중 적어도 하나를 기반으로 수행된다.As an embodiment, the neural network compression is performed based on at least one of a parameter removal technique, a weight sharing technique, and a parameter quantization technique for the full weight kernel.

본 발명의 실시 예에 따른 컨볼루션 신경망 시스템은 외부 메모리로부터 복수의 입력 값을 포함하는 입력 타일을 수신하고, 상기 수신된 입력 타일의 상기 복수의 입력 값을 저장하도록 구성되는 입력 버퍼 장치, 희소 가중치 커널에서 '0'이 아닌 희소 가중치의 위치를 가리키는 희소 인덱스를 기반으로, 상기 입력 버퍼 장치로부터의 상기 복수의 입력 값 중 적어도 하나의 입력 값을 출력하도록 구성되는 데이터 선택기, 상기 데이터 선택기로부터 출력되는 상기 적어도 하나의 입력 값 및 상기 희소 가중치를 기반으로 컨볼루션 연산을 수행하도록 구성되는 곱셈-누산(MAC, multiply-accumulate) 연산기, 및 상기 MAC 연산기로부터의 상기 컨볼루션 연산의 결과 값을 저장하고, 상기 저장된 결과 값을 출력 타일로써 상기 외부 메모리로 제공하도록 구성되는 출력 버퍼 장치를 포함한다.The convolutional neural network system according to an embodiment of the present invention receives an input tile including a plurality of input values from an external memory, and an input buffer device and a sparse weight configured to store the plurality of input values of the received input tile. a data selector configured to output at least one input value among the plurality of input values from the input buffer device based on a sparse index indicating a location of a sparse weight other than '0' in the kernel; a multiply-accumulate (MAC) operator configured to perform a convolution operation based on the at least one input value and the sparse weight, and a resultant value of the convolution operation from the MAC operator; and an output buffer device configured to provide the stored result value as an output tile to the external memory.

실시 예로서, 상기 데이터 선택기는 스위치 회로, 및 복수의 멀티플렉서(MUX, multiplexer)를 포함하고, 상기 스위치 회로는 상기 입력 타일의 크기 및 상기 희소 가중치 커널을 기반으로, 상기 복수의 입력 값 각각을 상기 복수의 MUX 각각으로 연결하도록 구성되고, 상기 복수의 MUX 각각은 상기 희소 인덱스를 기반으로, 상기 연결된 입력 값들 중 상기 희소 가중치의 위치와 대응되는 상기 적어도 하나의 입력 값을 선택하여 출력하도록 구성된다.In an embodiment, the data selector includes a switch circuit and a plurality of multiplexers (MUX), wherein the switch circuit selects each of the plurality of input values based on the size of the input tile and the sparse weight kernel. Each of the plurality of MUXs is configured to be connected, and each of the plurality of MUXs is configured to select and output the at least one input value corresponding to the location of the sparse weight among the connected input values based on the sparse index.

실시 예로서, 상기 복수의 MUX 각각은 상기 희소 가중치 커널에서 '0'인 가중치의 위치와 대응되는 입력 값을 출력하지 않는다. As an embodiment, each of the plurality of MUXs does not output an input value corresponding to a position of a weight value of '0' in the sparse weight kernel.

실시 예로서, 상기 복수의 MUX 각각으로부터의 상기 적어도 하나의 입력 값은 상기 희소 가중치의 위치와 대응되는 입력 값이다.In an embodiment, the at least one input value from each of the plurality of MUXs is an input value corresponding to the location of the sparse weight.

실시 예로서, 상기 희소 가중치 커널이 K×K (단, K는 자연수)의 크기를 갖는 경우, 상기 스위치 회로는 상기 복수의 MUX 각각으로 2K개의 입력 값들을 연결하도록 구성된다. As an embodiment, when the sparse weight kernel has a size of K×K (where K is a natural number), the switch circuit is configured to connect 2K input values to each of the plurality of MUXs.

실시 예로서, 상기 MAC 연산기는 상기 복수의 MUX 각각으로부터의 상기 적어도 하나의 입력 값 및 상기 희소 가중치 커널을 기반으로 상기 컨볼루션 연산을 각각 수행하도록 구성되는 복수의 MAC 코어를 포함한다.In an embodiment, the MAC operator includes a plurality of MAC cores configured to respectively perform the convolution operation based on the at least one input value from each of the plurality of MUXs and the sparse weight kernel.

본 발명의 실시 예에 따른 컨볼루션 신경망 시스템의 동작 방법은 입력 데이터의 일부인 입력 타일을 저장하는 단계, 희소 가중치 커널을 기반으로 상기 입력 타일의 입력 값들 각각을 복수의 멀티플렉서(MUX, multiplexer) 각각으로 연결하는 단계, 상기 복수의 MUX 각각에서, 상기 희소 가중치 커널에 대한 희소 인덱스를 기반으로 상기 연결된 입력 값들 중 적어도 하나를 선택하는 단계, 상기 희소 가중치 커널을 사용하여 상기 선택된 적어도 하나의 입력 값에 대한 컨볼루션 연산을 수행하는 단계, 상기 컨볼루션 연산의 결과를 누적하는 단계, 및 상기 누적된 결과를 출력 타일로써 외부 메모리로 제공하는 단계를 포함한다.A method of operating a convolutional neural network system according to an embodiment of the present invention includes storing an input tile that is a part of input data, and converting each of the input values of the input tile based on a sparse weight kernel to a plurality of multiplexers (MUX). connecting, in each of the plurality of MUXs, selecting at least one of the connected input values based on a sparse index for the sparse weight kernel, and using the sparse weight kernel for the selected at least one input value performing a convolution operation, accumulating a result of the convolution operation, and providing the accumulated result to an external memory as an output tile.

실시 예로서, 상기 복수의 MUX 각각에서, 상기 희소 가중치 커널에 대한 희소 인덱스를 기반으로 상기 연결된 입력 값들 중 적어도 하나를 선택하는 단계는, 상기 희소 가중치 커널에서 '0'이 아닌 가중치의 위치와 대응되는 입력 값들을 선택하고, 상기 희소 가중치 커널에서 '0'인 가중치의 위치와 대응되는 입력 값들을 선택하지 않는 단계를 포함한다.As an embodiment, in each of the plurality of MUXs, the step of selecting at least one of the connected input values based on the sparse index for the sparse weight kernel corresponds to a position of a weight other than '0' in the sparse weight kernel and selecting input values to be used, and not selecting input values corresponding to a weight position of '0' in the sparse weight kernel.

본 발명에 따르면, 희소(sparse) 행렬로 구성되는 파라미터(예를 들어, 가중치 커널)를 사용하는 컨볼루션 신경망 알고리즘의 연산을 보다 효과적으로 수행하는 컨볼루션 신경망 시스템이 제공된다. According to the present invention, there is provided a convolutional neural network system that more effectively performs an operation of a convolutional neural network algorithm using a parameter (eg, a weighted kernel) composed of a sparse matrix.

본 발명에 따른 컨볼루션 신경망 시스템은 희소 행렬을 기반으로 입력 데이터에 대한 컨볼루션 연산을 선택적으로 수행할 수 있다. 따라서, 본 발명에 따른 컨볼루션 신경망 시스템은 작은 하드웨어에서 효과적인 연산 흐름을 갖게 되므로, 컨볼루션 신경망 시스템의 전제적인 연산 효율이 향상된다. The convolutional neural network system according to the present invention may selectively perform a convolution operation on input data based on a sparse matrix. Therefore, since the convolutional neural network system according to the present invention has an effective computation flow in small hardware, the overall computational efficiency of the convolutional neural network system is improved.

또한, 본 발명에 따른 컨볼루션 신경망 시스템은 희소 가중치 커널을 처리함에 있어서, 효과적인 하드웨어 구조를 제공할 수 있다. 일반적으로 하드웨어 구성은 등배열로 구현되고 반복성을 갖도록 동작이 되는 것이 좋기 때문에, 본 발명에 따른 컨볼루션 신경망 시스템은 하드웨어 배열 규칙성(regularity)를 유지하면서도 효과적으로 하드웨어 엔진을 운용할 수 있다. In addition, the convolutional neural network system according to the present invention can provide an effective hardware structure in processing the sparse weight kernel. In general, since the hardware configuration is preferably implemented in an iso-array and operated to have repeatability, the convolutional neural network system according to the present invention can effectively operate the hardware engine while maintaining the hardware arrangement regularity.

도 1은 본 발명의 실시 예에 따른 컨볼루션 신경망(CNN; Convolution Neural Network)에서 구현되는 계층들을 예시적으로 보여주는 도면이다.
도 2는 도 1의 CNN에서의 컨볼루션 계층의 동작을 설명하기 위한 도면이다.
도 3은 부분 컨볼루션 연산을 수행하는 CNN 시스템을 구현하기 위한 하드웨어 구성을 예시적으로 보여주는 블록도이다.
도 4는 도 3의 CNN 시스템의 컨볼루션 연산을 설명하기 위한 도면이다.
도 5는 본 발명의 희소 가중치 커널을 예시적으로 보여주는 도면이다.
도 6은 본 발명의 실시 예에 따른 CNN 시스템의 하드웨어 구성을 보여주는 블록도이다.
도 7은 도 6의 CNN 시스템을 더욱 상세하게 보여주는 블록도이다.
도 8 및 도 9는 도 7의 CNN 시스템의 동작을 더욱 상세하게 설명하기 위한 도면들이다.
도 10은 본 발명에 따른 컨볼루션 신경망 시스템의 동작을 간략하게 보여주는 순서도이다.1 is a diagram exemplarily showing layers implemented in a convolutional neural network (CNN) according to an embodiment of the present invention.
FIG. 2 is a diagram for explaining an operation of a convolutional layer in the CNN of FIG. 1 .
3 is a block diagram exemplarily showing a hardware configuration for implementing a CNN system that performs a partial convolution operation.
FIG. 4 is a diagram for explaining a convolution operation of the CNN system of FIG. 3 .
5 is a diagram exemplarily showing a sparse weight kernel of the present invention.
6 is a block diagram showing a hardware configuration of a CNN system according to an embodiment of the present invention.
7 is a block diagram showing the CNN system of FIG. 6 in more detail.
8 and 9 are diagrams for explaining the operation of the CNN system of FIG. 7 in more detail.
10 is a flowchart briefly showing the operation of the convolutional neural network system according to the present invention.

이하에서, 본 발명의 기술 분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있을 정도로, 본 발명의 실시 예들이 명확하고 상세하게 기재될 것이다.Hereinafter, embodiments of the present invention will be described clearly and in detail to the extent that those skilled in the art can easily practice the present invention.

일반적으로, 컨볼루션(Convolution) 연산은 두 함수 간의 상관관계를 검출하기 위한 연산을 가리킨다. '컨볼루션 신경망(Convolutional Neural Network; 이하에서, CNN이라 칭함)'이라는 용어는 입력 데이터 또는 특정 피처(Feature)를 가리키는 커널(Kernel)과 특정 파라미터(예를 들어, 가중치, 바이어스 등) 사이의 컨볼루션 연산을 반복 수행함으로써, 이미지의 패턴을 결정하거나 또는 이미지의 특징을 추출하는 과정 또는 시스템을 통칭할 수 있다.In general, a convolution operation refers to an operation for detecting a correlation between two functions. The term 'Convolutional Neural Network (hereinafter referred to as CNN)' refers to a convolution between input data or a kernel pointing to a specific feature and a specific parameter (eg, weight, bias, etc.). A process or system for determining a pattern of an image or extracting features of an image by repeatedly performing a solution operation may be collectively referred to as a process or system.

이하에서, 특정 연산 동작을 위하여 CNN 시스템으로 제공되거나 또는 특정 연산의 결과로 생성 또는 출력되는 값은 데이터라 칭한다. 데이터는 CNN 시스템으로 입력되는 이미지 또는 CNN 시스템 내의 특정 계층에서 생성되는 특징 맵(feature map) 또는 특정 값들을 가리킬 수 있다.Hereinafter, a value provided to the CNN system for a specific operation operation or generated or output as a result of a specific operation is referred to as data. The data may point to an image input to the CNN system or a feature map or specific values generated in a specific layer within the CNN system.

또한, 입력 데이터에 대한 신호 처리(예를 들어, 컨볼루션 연산)에서 사용되는 필터, 윈도우, 마스크 등은 커널(kernel)의 용어로 통칭된다. 또한, 이하의 상세한 설명에서, 본 발명의 실시 예를 명확하게 설명하고, 실시 예들의 모호함을 피하기 위하여, 당업자에 의해 잘 알려진 기능, 구성, 회로, 시스템, 또는 동작들은 생략된다.In addition, filters, windows, masks, etc. used in signal processing (eg, convolution operation) of input data are collectively referred to as kernel terms. In addition, in the following detailed description, functions, configurations, circuits, systems, or operations well-known by those skilled in the art are omitted in order to clearly describe the embodiments of the present invention and to avoid ambiguity of the embodiments.

또한, 상세한 설명 또는 도면에서 사용되는 기능 블록들은 본 발명의 실시 예에서 소프트웨어, 하드웨어, 또는 그것들의 조합으로 구현될 수 있으며, 소프트웨어는 기계 코드, 펌웨어, 임베디드 코드, 및 애플리케이션 소프트웨어일 수 있고, 하드웨어는 회로, 프로세서, 컴퓨터, 집적 회로, 집적 회로 코어들, 압력 센서, 관성 센서, 멤즈(MEMS; microelectromechanical system), 수동 소자들, 또는 그것들의 조합일 수 있다. In addition, the functional blocks used in the detailed description or drawings may be implemented in software, hardware, or a combination thereof in an embodiment of the present invention, and the software may be machine code, firmware, embedded code, and application software, and hardware may be a circuit, a processor, a computer, an integrated circuit, integrated circuit cores, a pressure sensor, an inertial sensor, a microelectromechanical system (MEMS), passive components, or a combination thereof.

도 1은 본 발명의 실시 예에 따른 컨볼루션 신경망에서 구현되는 계층들(layers)을 예시적으로 보여주는 도면이다. 도 1을 참조하면, 컨볼루션 신경망(Convolutional Neural Network; 이하에서, "CNN"이라 칭함)(10)은 다양한 계층들에서의 다양한 연산(예를 들어, 컨볼루션 연산, 서브-샘플링 등)을 통해 입력 데이터를 완전 연결 계층으로 출력할 수 있다.1 is a diagram exemplarily showing layers implemented in a convolutional neural network according to an embodiment of the present invention. Referring to FIG. 1 , a convolutional neural network (hereinafter, referred to as “CNN”) 10 through various operations (eg, convolutional operations, sub-sampling, etc.) in various layers. The input data can be output as a fully connected layer.

예를 들어, 제1 데이터(D1)가 CNN(10)으로 입력되는 입력 데이터이며, 1×28×28 픽셀 크기를 갖는 그레이(gray) 이미지인 것으로 가정한다. 즉, 제1 데이터(D1)의 채널 깊이는 '1'일 것이다. 제1 데이터(D1)가 CNN(10)으로 입력되면, 제1 계층(L1)은 제1 커널(K1)을 사용하여, 제1 데이터(D1)에 대한 컨볼루션 연산을 수행하여 제2 데이터(D2)를 출력 또는 생성할 수 있다. 예를 들어, 제1 계층(L1)은 컨볼루션 계층일 수 있다. 제1 커널(K1)이 5×5의 크기를 갖고, 제1 데이터(D1)의 에지(edge) 영역에서 데이터 패딩 없이 컨볼루션 연산이 수행되는 경우, 제2 데이터(D2)는 24×24의 크기를 갖고, 제2 데이터(D2)는 20개의 채널을 가질 것이다. 즉, 제2 데이터(D2)는 24×24×20(데이터 폭×데이터 너비×채널)의 크기로 출력될 수 있다.For example, it is assumed that the first data D1 is input data input to the CNN 10 and is a gray image having a size of 1×28×28 pixels. That is, the channel depth of the first data D1 may be '1'. When the first data D1 is input to the CNN 10, the first layer L1 uses the first kernel K1 to perform a convolution operation on the first data D1 to perform a convolution operation on the second data ( D2) can be output or generated. For example, the first layer L1 may be a convolutional layer. When the first kernel K1 has a size of 5×5 and a convolution operation is performed without data padding in an edge region of the first data D1, the second data D2 has a size of 24×24. size, and the second data D2 will have 20 channels. That is, the second data D2 may be output in a size of 24×24×20 (data width×data width×channel).

이후에, 제2 계층(L2)은 제2 데이터(D2)에 대한 풀링 동작을 수행하여 제3 데이터(D3)를 출력 또는 생성할 수 있다. 예를 들어, 제2 계층(L2)은 풀링 계층(pooling layer)일 수 있다. 제2 계층(L2)에서의 풀링 동작은 제2 데이터(D2)에 대하여, 공간(spatial) 도메인 상에서 채널 수는 유지하면서 채널의 폭(width)과 높이(height)를 조정하는 동작을 가리킨다. 좀 더 상세한 예로써, 제2 계층(L2)이 2×2의 크기를 갖는 제2 커널(K2)을 사용하여 풀링 동작을 수행하는 경우, 제2 계층(L2)으로부터 생성된 제3 데이터(D3)는 12×12의 크기를 갖고, 20개의 채널을 가질 것이다. 즉, 제3 데이터(D3)는 20×12×12(채널×데이터 폭×데이터 너비)의 크기로 출력될 수 있다.Thereafter, the second layer L2 may output or generate the third data D3 by performing a pulling operation on the second data D2 . For example, the second layer L2 may be a pooling layer. The pulling operation in the second layer L2 refers to an operation of adjusting the width and height of a channel while maintaining the number of channels in the spatial domain with respect to the second data D2. As a more detailed example, when the second layer L2 performs a pooling operation using the second kernel K2 having a size of 2×2, the third data D3 generated from the second layer L2 ) will have a size of 12×12 and have 20 channels. That is, the third data D3 may be output in a size of 20×12×12 (channel×data width×data width).

이후에, 제3 계층(L3)은 제3 커널(K3)을 사용하여, 제3 데이터(D3)에 대한 컨볼루션 연산을 수행하여, 제4 데이터(D4)를 출력 또는 생성할 수 있다. 이후에, 제4 계층(L4)은 제4 커널(K4)을 사용하여 제4 데이터에 대한 풀링 동작을 수행하여, 제5 데이터(D5)를 출력 또는 생성할 수 있다. 이 때, 제4 데이터(D4)는 50×8×8(채널×데이터 폭×데이터 너비)의 크기로 출력될 수 있고, 제5 데이터(D5)는 50×4×4(채널×데이터 폭×데이터 너비)의 크기로 출력될 수 있다. 예시적으로, 제3 및 제4 계층들(L3, L4) 각각은 컨볼루션 계층 및 풀링 계층일 수 있으며, 제1 및 제2 계층들(L1, L2)과 유사한 동작을 수행할 수 있다. 예시적으로, 제1 내지 제4 계층들(L1~L4)에 대한 동작들은 특정 조건이 만족될 때까지 반복 수행될 수 있다. Thereafter, the third layer L3 may output or generate the fourth data D4 by performing a convolution operation on the third data D3 using the third kernel K3 . Thereafter, the fourth layer L4 may output or generate the fifth data D5 by performing a pooling operation on the fourth data using the fourth kernel K4 . In this case, the fourth data D4 may be output in a size of 50×8×8 (channel×data width×data width), and the fifth data D5 may be 50×4×4 (channel×data width×data width). data width). For example, each of the third and fourth layers L3 and L4 may be a convolutional layer and a pooling layer, and may perform operations similar to those of the first and second layers L1 and L2 . For example, operations for the first to fourth layers L1 to L4 may be repeatedly performed until a specific condition is satisfied.

제5 계층(L5)은 제5 데이터(D5)에 대한 완전 연결망(FCN; full connected network) 동작을 수행하여, 완전 연결 데이터(20)를 출력할 수 있다. 예시적으로, 제5 계층(L5)은 완전 연결 계층(full connected layer)은 제1 또는 제3 계층(L1, L3)의 컨볼루션 계층과 달리, 커널을 사용하지 않으며, 입력 데이터의 전체 노드가 출력 데이터의 전체 노드와 모든 연결 관계를 유지할 수 있다.The fifth layer L5 may output the fully connected data 20 by performing a fully connected network (FCN) operation on the fifth data D5 . Exemplarily, in the fifth layer (L5), the fully connected layer does not use a kernel, unlike the convolutional layers of the first or third layers (L1, L3), and all nodes of the input data are It is possible to maintain all nodes and all connections of the output data.

예시적으로, 도 1에 도시된 CNN(10)의 각 계층들(L1~L5)은 간략화된 것이며, 실제 CNN(10)은 더 많은 계층들을 포함할 수 있다.For example, each of the layers L1 to L5 of the CNN 10 shown in FIG. 1 is simplified, and the actual CNN 10 may include more layers.

예시적으로, 도 1의 각 계층(L1~L5)에서의 파라미터의 개수 및 연결 개수는 표 1과 같을 수 있다. 예시적으로, 표 1에 기재된 수치들은 도 1에 도시된 각 데이터의 사이즈에 기반된 것으로써, 예시적인 것이다.Exemplarily, the number of parameters and the number of connections in each layer (L1 to L5) of FIG. 1 may be as shown in Table 1. Illustratively, the numerical values shown in Table 1 are based on the size of each data shown in FIG. 1 and are exemplary.

계층hierarchy 제1 계층(L1)
컨볼루션 계층first layer (L1)
convolutional layer 제3 계층(L3)
컨볼루션 계층Layer 3 (L3)
convolutional layer 제5 계층(L5)
완전 연결 계층5th layer (L5)
fully connected layer 가중치 개수number of weights 500500 25,00025,000 400,000400,000 바이어스 개수number of biases 2020 5050 500500 연결 개수number of connections 299,520299,520 1,603,2001,603,200 400,500400,500

표 1을 참조하면, 각 계층의 가중치 개수는 {출력 채널의 개수*입력 채널의 개수*커널의 높이*커널의 너비}이다. 즉, 제1 계층(L1)에 대하여, 출력 채널의 개수는 20이고, 입력 채널의 개수는 1이고, 커널의 높이는 5이고, 커널의 너비는 5이므로, 제1 계층(L1)에서 사용되는 가중치의 개수는 20*1*5*5=500개이다. 마찬가지로, 제3 계층(L3)에서 사용되는 가중치의 개수는 25,000개이고, 제5 계층(L5)에서 사용되는 가중치의 개수는 400,000개이다.Referring to Table 1, the number of weights in each layer is {the number of output channels*the number of input channels*the height of the kernel*the width of the kernel}. That is, for the first layer (L1), the number of output channels is 20, the number of input channels is 1, the height of the kernel is 5, and the width of the kernel is 5, so the weight used in the first layer (L1) is The number of is 20*1*5*5=500. Similarly, the number of weights used in the third layer L3 is 25,000, and the number of weights used in the fifth layer L5 is 400,000.

각 계층의 바이어스의 개수는 {출력 채널의 개수}이다. 즉, 제1 계층(L1)에 대하여, 출력 채널의 개수는 20이므로, 제1 계층(L1)에서 사용되는 바이어스의 개수는 20개이다. 마찬가지로, 제3 계층(L3)에서 사용되는 바이어스의 개수는 50이고, 제5 계층(L5)에서 사용되는 바이어스의 개수는 500이다.The number of biases in each layer is {number of output channels}. That is, since the number of output channels is 20 with respect to the first layer L1 , the number of biases used in the first layer L1 is 20 . Similarly, the number of biases used in the third layer L3 is 50, and the number of biases used in the fifth layer L5 is 500.

각 계층의 연결 개수는 {파라미터의 개수*출력 데이터의 높이*출력 데이터의 너비}와 동일하다. 파라미터의 개수는 가중치의 개수 및 바이어스의 개수의 합을 가리킨다. 즉, 제1 계층(L1)에 대하여, 파라미터의 개수는 520개이고, 출력 데이터의 높이는 24이고, 출력 데이터의 너비는 24이므로, 제1 계층(L1)의 연결 개수는 520*24*24=299,520개이다. 마찬가지로, 제3 계층(L3)의 연결 개수는 1,603,200개이고, 제5 계층(L5)의 연결 개수는 400,500개이다. The number of connections in each layer is equal to {number of parameters * height of output data * width of output data}. The number of parameters indicates the sum of the number of weights and the number of biases. That is, with respect to the first layer (L1), the number of parameters is 520, the height of the output data is 24, and the width of the output data is 24, so the number of connections of the first layer (L1) is 520*24*24=299,520 it's a dog Similarly, the number of connections in the third layer L3 is 1,603,200, and the number of connections in the fifth layer L5 is 400,500.

표 1에 개시된 바와 같이, 컨볼루션 계층(예를 들어, L1, L3)은 완전 연결 계층(예를 들어, L5)보다 적은 파라미터의 개수를 갖는다. 그러나, 일부 컨볼루션 계층(예를 들어, L3)은 완전 연결 계층(예를 들어, L5)보다 더 많은 연결 수를 갖기 때문에, 일부 컨볼루션 계층에서는 더 많은 연산량을 요구한다. 이러한 컨볼루션 계층의 연산량을 감소시키기 위한 다양한 방법들이 개발되고 있다. As shown in Table 1, convolutional layers (eg, L1, L3) have fewer parameters than fully connected layers (eg, L5). However, since some convolutional layers (eg, L3) have more connections than fully connected layers (eg, L5), some convolutional layers require more computation. Various methods have been developed to reduce the amount of computation of such a convolutional layer.

예시적으로, 앞서 설명된 바와 유사하게, 신경망은 입력 계층(Input layer), 히든 계층(Hidden layer), 및 출력 계층(Output layer)을 포함할 수 있다. 입력 계층은 학습을 수행하기 위한 입력 데이터를 수신하여 히든 계층으로 전달하도록 구성되고, 출력 계층은 히든 계층으로부터의 데이터를 기반으로 신경망의 출력을 생성하도록 구성된다. 히든 계층은 입력 계층을 통해 전달된 입력 데이터를 예측하기 쉬운 값으로 변화시킬 수 있다. 입력 계층과 히든 계층에 포함된 노드들은 가중치를 통해서 서로 연결되고, 히든 계층과 출력 계층에 포함된 노드들에서도 가중치를 통해 서로 연결될 수 있다. Exemplarily, similar to that described above, the neural network may include an input layer, a hidden layer, and an output layer. The input layer is configured to receive input data for performing learning and transmit it to the hidden layer, and the output layer is configured to generate an output of the neural network based on data from the hidden layer. The hidden layer can change the input data passed through the input layer into a value that is easy to predict. Nodes included in the input layer and the hidden layer may be connected to each other through weights, and nodes included in the hidden layer and the output layer may also be connected to each other through weights.

신경망에서 입력 계층과 히든 계층 사이에 연산 처리량은 입출력 데이터들의 수 또는 크기에 따라 결정될 수 있다. 또한, 각 계층의 깊이가 깊어질수록 가중치의 크기 및 입출력 계층에 따른 연산 처리량이 급격하게 증가할 수 있다. 따라서, 신경망을 하드웨어로 구현하기 위하여 이러한 파라미터의 크기를 줄이기 위한 방법 또는 장치가 요구될 수 있다.In the neural network, the throughput between the input layer and the hidden layer may be determined according to the number or size of input/output data. In addition, as the depth of each layer increases, the size of the weight and the calculation throughput according to the input/output layer may rapidly increase. Accordingly, a method or apparatus for reducing the size of these parameters may be required in order to implement the neural network in hardware.

예를 들어, 파라미터의 크기를 줄이기 위한 방법으로써, 신경망 압축이 사용될 수 있다. 신경망 압축은 파라미터 제거(Drop out) 기법, 가중치 공유 기법, 양자화 기법 등을 포함할 수 있다. 파라미터 제거 기법은 신경망 내부의 파라미터들 중에서 가중치가 낮은 파라미터를 제거하는 방법이다. 가중치 공유 기법은 가중치가 비슷한 파라미터를 서로 공유하여 처리할 파라미터의 수를 감소시키는 기법이다. 그리고 양자화 기법은 가중치와 입출력 계층 및 히든 계층의 비트들의 크기를 양자화하여 파라미터의 수를 줄이기 위해 사용된다. 이상에서는 CNN(10)의 각 계층별 데이터와 커널들, 그리고 연결 파라미터들이 간략히 설명되었다.For example, as a method for reducing the size of a parameter, neural network compression may be used. Neural network compression may include a parameter dropout technique, a weight sharing technique, a quantization technique, and the like. The parameter removal technique is a method of removing a parameter with a low weight among parameters inside the neural network. The weight sharing technique is a technique for reducing the number of parameters to be processed by sharing parameters with similar weights. In addition, the quantization technique is used to reduce the number of parameters by quantizing the weights and sizes of bits of the input/output layer and the hidden layer. In the above, data, kernels, and connection parameters for each layer of the CNN 10 have been briefly described.

도 2는 도 1의 CNN에서의 컨볼루션 계층의 동작을 설명하기 위한 도면이다. 간결한 설명을 위하여, CNN(10)의 컨볼루션 계층을 설명하는데 불필요한 구성 요소들은 생략된다. 또한, 컨볼루션 계층은 도 1의 제1 계층(L1)인 것으로 가정한다.FIG. 2 is a diagram for explaining an operation of a convolutional layer in the CNN of FIG. 1 . For concise description, components unnecessary to describe the convolutional layer of the CNN 10 are omitted. Also, it is assumed that the convolutional layer is the first layer L1 of FIG. 1 .

도 1 및 도 2를 참조하면, 입력 데이터(Din)는 N×W×H의 크기를 갖고, 입력 데이터(Din)에 대하여 컨볼루션 연산이 수행된 출력 데이터(Dout)는 M×C×R의 크기를 갖는다. 이 때, N은 입력 데이터(Din)의 채널의 개수를 가리키고, W는 입력 데이터(Din)의 너비를 가리키고, H는 입력 데이터(Din)의 높이를 가리킨다. M은 출력 데이터(Dout)의 채널의 개수를 가리키고, C는 출력 데이터(Dout)의 너비를 가리키고, R은 출력 데이터(Dout)의 높이를 가리킨다.1 and 2 , the input data Din has a size of N×W×H, and the output data Dout on which a convolution operation is performed on the input data Din is M×C×R. have a size In this case, N indicates the number of channels of the input data Din, W indicates the width of the input data Din, and H indicates the height of the input data Din. M indicates the number of channels of the output data Dout, C indicates the width of the output data Dout, and R indicates the height of the output data Dout.

제1 계층(L1)의 곱셈-누산(MAC; Multiply Accumulate) 코어(L1_1)는 복수의 커널(KER_1~KER_M)을 기반으로, 입력 데이터(Din)에 대한 컨볼루션 연산을 수행하여 출력 데이터(Dout)를 생성할 수 있다. 예를 들어, 복수의 커널(KER_1~KER_M) 각각은 N×K×K의 크기를 가질 수 있다. MAC 코어(L1_1)는 K×K 사이즈의 커널과 입력 데이터(Din)의 중첩되는 데이터들 각각이 서로 곱할 수 있다(Multiplexing). MAC 코어(L1_1)는 입력 데이터(Din)의 각 채널 별로 곱해진 데이터의 값을 합산(Accumulation)하여 하나의 출력 데이터의 값(즉, 1×1×1의 데이터 값)으로 생성할 수 있다. MAC 코어(L1_1)는 이러한 연산 동작을 반복 수행하여, 복수의 커널(KER_1~KER_M) 각각에 대한 출력 데이터(Dout)를 생성할 수 있다. 이 때, 출력 데이터(Dout)의 채널의 개수는 복수의 커널(KER_1~KER_M)의 개수(즉, M개)와 동일할 것이다. The multiply-accumulate (MAC) core L1_1 of the first layer L1 performs a convolution operation on the input data Din based on the plurality of kernels KER_1 to KER_M to perform a convolution operation on the output data Dout ) can be created. For example, each of the plurality of kernels KER_1 to KER_M may have a size of N×K×K. The MAC core L1_1 may multiply each overlapping data of the K×K kernel and the input data Din (multiplexing). The MAC core L1_1 may generate one output data value (ie, a data value of 1×1×1) by accumulating the values of the data multiplied by each channel of the input data Din. The MAC core L1_1 may generate output data Dout for each of the plurality of kernels KER_1 to KER_M by repeatedly performing such an operation. In this case, the number of channels of the output data Dout may be the same as the number (ie, M) of the plurality of kernels KER_1 to KER_M.

예시적으로, MAC 코어(L1_1)는 가산기, 곱셈기, 레지스터 등을 사용하여 상술된 컨볼루션 연산을 수행할 수 있다. 예를 들어, MAC 코어(L1_1)의 곱셈기는 입력 데이터의 입력 값 및 대응하는 가중치 값에 대한 곱셈 연산을 수행할 수 있다. 가산기는 곱셈 연산의 결과 및 레지스터에 저장된 이전 연산의 결과에 대한 덧셈 연산을 수행할 수 있다. 레지스터는 덧셈 연산의 결과값을 저장할 수 있다. 이후, 다른 입력 값이 MAC 코어(L1_1)로 입력되고, 상술된 연산을 반복 수행함으로써, 컨볼루션 연산이 수행될 수 있다. For example, the MAC core L1_1 may perform the above-described convolution operation using an adder, a multiplier, a register, or the like. For example, the multiplier of the MAC core L1_1 may perform a multiplication operation on an input value of input data and a corresponding weight value. The adder may perform an addition operation on a result of a multiplication operation and a result of a previous operation stored in a register. The register can store the result of the addition operation. Thereafter, another input value is input to the MAC core L1_1, and by repeatedly performing the above-described operation, a convolution operation may be performed.

그러나, 본 발명의 범위가 이에 한정되는 것은 아니며, MAC 코어(L1_1) 대신에 단순한 가산기, 곱셈기, 및 별도의 저장 회로 등을 통해 상술된 컨볼루션 연산이 구현될 수 있다. 바이어스(BIAS)는 채널의 수(M)의 크기로 출력 데이터(Dout)에 더해질 수 있다.However, the scope of the present invention is not limited thereto, and the above-described convolution operation may be implemented through a simple adder, a multiplier, and a separate storage circuit instead of the MAC core L1_1. The bias BIAS may be added to the output data Dout in the magnitude of the number M of channels.

예시적으로, 상술된 컨볼루션 연산의 흐름은 표 2와 같이 표현될 수 있다. 표 2에 기재된 알고리즘 구성 또는 프로그램 코드는 컨볼루션 연산의 흐름을 예시적으로 보여주기 위한 것이며, 본 발명의 범위가 이에 한정되는 것은 아니다. Exemplarily, the flow of the above-described convolution operation may be expressed as shown in Table 2. The algorithm configuration or program code described in Table 2 is for illustrating the flow of the convolution operation by way of example, and the scope of the present invention is not limited thereto.

// Basic convolution computation
for ( row=0 ; row<R ; row++) {
for ( col=0 ; col<C ; col++) {
for ( to=0 ; to<M ; to++) {
for ( ti=0 ; ti<N ; ti++) {
for ( i=0 ; i<K ; i++) {
for ( j=0 ; j<K ; j++) {
output [to] [row] [col] +=
weights [to] [ti] [i] [j] *
input [ti] [ S*row+i] [ S*col+j] ;
}}}}}// Basic convolution computation
for ( row=0 ; row<R ; row++) {
for ( col=0 ; col<C ; col++) {
for ( to=0 ; to<M ; to++) {
for ( ti=0 ; ti<N ; ti++) {
for ( i=0 ; i<K ; i++) {
for ( j=0 ; j<K ; j++) {
output [to] [row] [col] +=
weights [to] [ti] [i] [j] *
input [ti] [ S*row+i] [ S*col+j] ;
}}}}}

표 2를 참조하면, input 은 입력 데이터(Din)이고 output은 출력 데이터(Dout)이다. R, C, M, N, K는 상술된 입력 데이터(Din) 및 출력 데이터(Dout)의 크기를 나타내는 변수들이다. H, W와 R, C의 상관 관계는 H=R+K-1이며, W=C+K-1로 표현될 수 있다.Referring to Table 2, input is input data (Din) and output is output data (Dout). R, C, M, N, and K are variables representing the sizes of the above-described input data Din and output data Dout. The correlation between H and W and R and C is H=R+K-1, which can be expressed as W=C+K-1.

상술된 컨볼루션 연산의 흐름에 따르면, 입출력 데이터의 크기가 매우 클 경우, 연산을 위한 메모리의 대역폭의 제한으로 인하여, 정상적인 연산 동작이 어려울 수 있다.According to the flow of the convolution operation described above, when the size of input/output data is very large, a normal operation operation may be difficult due to the limitation of the bandwidth of the memory for the operation.

상술된 바와 같은 CNN(10)을 효율적으로 하드웨어로 구현하기 위해서는 다양한 요건들이 고려되어야 한다. 예를 들어, CNN(10)을 하드웨어로 구현하기 위해서는, 데이터 및 파라미터 전송을 위해 필요한 메모리 대역폭을 최소화하는 것이 요구된다. 객체를 인식하기 위해서는 카메라로부터 입력되는 실시간 영상 데이터나 외부 메모리에 저장된 이미지 데이터가 CNN(10)을 구성하는 하드웨어 회로로 입력된다. 구체적인 예로써, 실시간 영상에서 초당 약 30프레임을 지원하기 위해서는 매우 큰 메모리 대역폭이 요구된다. 3개 채널(레드, 그린, 블루) 각각에서 640×480의 크기를 갖는 픽셀 데이터를 지원하기 위해서는, 초당 28MByte의 데이터가 입력 계층으로 지속적으로 입력되어야 한다. 또한, 입력 데이터와 별도로 컨볼루션 연산과 같은 다양한 연산에 사용되는 파라미터 데이터가 하드웨어 회로 입력되어야 한다. 일 예로써, AlexNet은 단일 이미지를 인식할 때마다 약 61,000,000개의 파라미터를 필요로 한다. 각각의 파라미터의 비트 폭이 16-bit라고 가정할 경우, 128MByte의 크기를 갖는 파라미터가 요구된다. 뿐만 아니라, 하드웨어 회로가 내부적으로 데이터와 파라미터가 동시에 연산하는 구조를 가지고 있기 때문에, 외부 메모리와 출력 데이터 및 파라미터를 수시로 교환해야 한다.In order to efficiently implement the CNN 10 as described above in hardware, various requirements must be considered. For example, in order to implement the CNN 10 in hardware, it is required to minimize the memory bandwidth required for data and parameter transmission. In order to recognize an object, real-time image data input from a camera or image data stored in an external memory is input to a hardware circuit constituting the CNN 10 . As a specific example, a very large memory bandwidth is required to support about 30 frames per second in a real-time image. In order to support pixel data having a size of 640×480 in each of the three channels (red, green, blue), data of 28 MBytes per second must be continuously input to the input layer. In addition, parameter data used for various operations such as a convolution operation must be input to the hardware circuit separately from the input data. As an example, AlexNet requires about 61,000,000 parameters every time it recognizes a single image. Assuming that the bit width of each parameter is 16-bit, a parameter having a size of 128 MByte is required. In addition, since the hardware circuit has a structure in which data and parameters are simultaneously calculated internally, it is necessary to frequently exchange output data and parameters with an external memory.

또한, CNN(10)을 구현하기 위한 하드웨어 회로에 포함된 컨볼루션 연산기를 효과적으로 구현함으로써 연산처리 성능을 향상시켜야 한다. 일반적으로 컨볼루션(convolution) 연산은 어레이(array) 구조로 배열된 프로세싱 엘리먼트(processing element)를 이용하여 수행된다. 이러한 어레이 구조 연산기에서는, 가중치(weight)와 바이어스(bias)로 구성되어 있는 파라미터(parameter)에 대한 제어 및 입출력 데이터의 버퍼링(buffering)에 대한 제어가 중요하다. 또한, 단위 시간 동안 처리되는 처리율을 향상시키기 위해서는, 배열 구조의 컨벌버 연산기로 입력하는 파라미터 버퍼링이 중요하다.In addition, by effectively implementing the convolution operator included in the hardware circuit for implementing the CNN 10, it is necessary to improve the arithmetic processing performance. In general, a convolution operation is performed using processing elements arranged in an array structure. In such an array structure operator, control over parameters composed of weights and biases and control over buffering of input/output data are important. In addition, in order to improve the processing rate for a unit time, it is important to buffer the parameters input to the array-structured convolver operator.

상술된 요건들을 고려하면, 입력 데이터, 출력 데이터, 또는 파라미터를 적절하게 분할함으로써, 많은 연산량을 처리하는 연산 하드웨어를 효율적으로 설계할 수 있다. 예를 들어, CNN(10)은 입력 데이터를 일정하게 분할하고, 분할된 데이터 단위로 입력 데이터를 읽고, 처리할 수 있다. 이후에, MAC 코어(L1_1)는 분할된 개수만큼 반복적으로 연산을 처리하여, 연산 결과를 외부 메모리에 저장할 수 있다. 즉, CNN(10)의 하드웨어 자원이 한정되어 있기 때문에, 입력 데이터를 분할하여 연산하는 부분 컨볼루션 연산을 반복적으로 사용함으로써, 하드웨어 자원의 한정을 극복할 수 있다.In consideration of the above-mentioned requirements, by appropriately dividing input data, output data, or parameters, it is possible to efficiently design computation hardware that processes a large amount of computation. For example, the CNN 10 may uniformly divide the input data, and read and process the input data in units of the divided data. Thereafter, the MAC core L1_1 may repeatedly process the operation as many as the divided number, and store the operation result in the external memory. That is, since the hardware resources of the CNN 10 are limited, it is possible to overcome the limitations of the hardware resources by repeatedly using a partial convolution operation that divides and calculates input data.

도 3은 부분 컨볼루션 연산을 수행하는 CNN 시스템을 구현하기 위한 하드웨어 구성을 예시적으로 보여주는 블록도이다. 도 3에서, 본 발명의 실시 예에 따른 신경망 시스템을 FPGA나 GPU 등의 하드웨어로 구현하기 위한 필수 구성들이 도시된다. 이하의 도면 및 상세한 설명에서 기재된 기능 블록들은 하드웨어 구성, 소프트웨어 구성, 또는 그것들의 조합으로 구현될 수 있다.3 is a block diagram exemplarily showing a hardware configuration for implementing a CNN system that performs a partial convolution operation. 3, essential components for implementing the neural network system according to an embodiment of the present invention in hardware such as FPGA or GPU are shown. Functional blocks described in the following drawings and detailed description may be implemented in a hardware configuration, a software configuration, or a combination thereof.

도 3을 참조하면, CNN 시스템(100)은 입력 버퍼 장치(110), 곱셈-누산(MAC; Multiply Accumulate) 연산기(120), 출력 버퍼 장치(130), 및 가중치 커널 버퍼 장치(140)를 포함할 수 있다. CNN(100)은 외부 메모리(101)와 연결되어, 입력 데이터의 일부(Din_T), 가중치 커널, 및 출력 데이터의 일부(Dout_T)를 교환하도록 구성될 수 있다.Referring to FIG. 3 , the CNN system 100 includes an input buffer device 110 , a multiply-accumulate (MAC) operator 120 , an output buffer device 130 , and a weight kernel buffer device 140 . can do. The CNN 100 may be connected to the external memory 101 and configured to exchange a portion of input data (Din_T), a weight kernel, and a portion of output data (Dout_T).

예를 들어, 입력 버퍼 장치(110)는 입력 데이터의 일부(Din_T)를 외부 메모리(101)로부터 로드할 수 있다. 예를 들어, 상술된 바와 같은 부분 연산을 수행하기 위하여, 입력 데이터가 일정하게 분할될 수 있다. 입력 버퍼 장치(110)는 분할된 입력 데이터의 일부(Din_T)를 외부 메모리(101)로부터 로드할 수 있다. 간결한 설명을 위하여, 입력 버퍼 장치(110)로 로드되는 입력 데이터의 일부(Din_T)는 입력 타일(input tile)이라 칭한다. For example, the input buffer device 110 may load a portion of input data Din_T from the external memory 101 . For example, in order to perform a partial operation as described above, input data may be uniformly divided. The input buffer device 110 may load a part Din_T of the divided input data from the external memory 101 . For concise description, a portion Din_T of input data loaded into the input buffer device 110 is referred to as an input tile.

예시적으로, 입력 버퍼 장치(110)의 크기는 컨볼루션 연산을 위한 커널(Kernel)의 사이즈에 따라 가변될 수 있을 것이다. 예를 들면, 커널의 사이즈가 K×K인 경우, MAC 연산기(120)에 의한 커널과의 컨볼루션 연산을 순차적으로 수행하기 위한 충분한 크기의 입력 데이터가 입력 버퍼 장치(110)로 로드되어야 할 것이다. 즉, 커널의 크기를 기반으로 입력 버퍼 장치(110)의 크기 또는 입력 타일(Din_T)의 크기가 결정될 수 있다. For example, the size of the input buffer device 110 may vary according to the size of a kernel for a convolution operation. For example, when the size of the kernel is K×K, input data of sufficient size for sequentially performing a convolution operation with the kernel by the MAC operator 120 should be loaded into the input buffer device 110 . . That is, the size of the input buffer device 110 or the size of the input tile Din_T may be determined based on the size of the kernel.

MAC 연산기(120)는 입력 버퍼 장치(110), 가중치 커널 버퍼 장치(140), 그리고 출력 버퍼(140)를 사용하여 컨볼루션 연산을 수행할 수 있다. 예를 들어, MAC 연산기(120)는 복수의 MAC 코어(121~12i)를 포함할 수 있다. 도 2를 참조하여 설명된 바와 같이, 복수의 MAC 코어(121~12i) 각각은 복수의 커널을 사용하여 입력 타일(Din_T)에 대한 컨볼루션 연산을 수행할 수 있다. 이 때, 컨볼루션 연산은 병렬로 처리될 수 있다. 복수의 MAC 코어(121~12i)의 개수는 커널의 크기 또는 입력 타일(Din_T)의 크기에 따라 결정될 수 있다. 예시적으로, 복수의 MAC 코어(121~12i) 각각은 도 2를 참조하여 설명된 MAC 코어(L1_1)와 유사한 동작을 수행하거나 또는 구조를 가질 수 있다. The MAC operator 120 may perform a convolution operation using the input buffer device 110 , the weight kernel buffer device 140 , and the output buffer 140 . For example, the MAC operator 120 may include a plurality of MAC cores 121 to 12i. As described with reference to FIG. 2 , each of the plurality of MAC cores 121 to 12i may perform a convolution operation on the input tile Din_T using a plurality of kernels. In this case, the convolution operation may be processed in parallel. The number of the plurality of MAC cores 121 to 12i may be determined according to the size of the kernel or the size of the input tile Din_T. For example, each of the plurality of MAC cores 121 to 12i may perform an operation similar to that of the MAC core L1_1 described with reference to FIG. 2 or may have a structure.

출력 버퍼 장치(130)는 MAC 연산기(120)에 의해 실행되는 컨볼루션 연산 또는 풀링 동작에 대한 출력 데이터의 일부(Dout_T)를 로드할 수 있다. 출력 버퍼 장치(130)에 로드된 출력 데이터의 일부(Dout_T)는 복수의 커널에 의한 각 컨볼루션 연산 루프의 실행 결과에 따라 업데이트될 수 있다. 또는 출력 버퍼 장치(130)에 로드된 출력 데이터의 일부(Dout_T)는 외부 메모리(101)로 제공되고, 복수의 출력 데이터의 일부(Dout_T)가 조합되어 출력 데이터(Dout)를 구성할 수 있다. 이하에서, 간결한 설명을 위하여, 출력 버퍼 장치(130)로 로드되는 출력 데이터의 일부(Dout_T)는 출력 타일(output tile)이라 칭한다. The output buffer device 130 may load a part of output data Dout_T for a convolution operation or a pooling operation executed by the MAC operator 120 . A portion of output data Dout_T loaded into the output buffer device 130 may be updated according to a result of execution of each convolutional loop by a plurality of kernels. Alternatively, a portion Dout_T of the output data loaded into the output buffer device 130 may be provided to the external memory 101 , and a portion Dout_T of a plurality of output data may be combined to form the output data Dout. Hereinafter, for concise description, a portion of output data Dout_T loaded into the output buffer device 130 is referred to as an output tile.

가중치 커널 버퍼 장치(140)는 MAC 연산기(120)에서 수행되는 컨볼루션 연산, 바이어스(Bias) 가산, 활성화(ReLU), 풀링(Pooling) 등에 필요한 파라미터들을 외부 메모리(101)로부터 로드하고, 로드된 파라미터들을 MAC 연산기(120)로 제공할 수 있다. 또한, 학습 단계에서 학습된 파라미터들이 가중치 커널 버퍼 장치(140)에 저장될 수도 있다. 가중치 커널 버퍼 장치(140)에 저장된 학습된 파라미터들은 외부 메모리(101)로 제공되고, 갱신될 수 있다. The weight kernel buffer device 140 loads parameters necessary for convolution operation, bias addition, activation (ReLU), pooling, etc. performed by the MAC operator 120 from the external memory 101, and the loaded The parameters may be provided to the MAC operator 120 . Also, parameters learned in the learning step may be stored in the weight kernel buffer device 140 . The learned parameters stored in the weight kernel buffer device 140 may be provided to the external memory 101 and updated.

도 4는 도 3의 CNN 시스템의 컨볼루션 연산을 설명하기 위한 도면이다. 간결한 설명을 위하여, 하나의 MAC 코어(121)가 컨볼루션 연산을 수행하는 구성이 도 4에 도시되며, CNN 시스템(100)의 컨볼루션 연산을 설명하는데 불필요한 구성 요소들은 생략된다.FIG. 4 is a diagram for explaining a convolution operation of the CNN system of FIG. 3 . For concise description, a configuration in which one MAC core 121 performs a convolution operation is shown in FIG. 4 , and components unnecessary to describe the convolution operation of the CNN system 100 are omitted.

도 3 및 도 4를 참조하면, 입력 버퍼 장치(110)는 입력 데이터(Din)의 일부인 입력 타일(Din_T)을 로드할 수 있다. 이 때, 입력 타일(Din_T)은 Tn×Tw×Th의 크기를 가질 수 있다. Tn은 입력 타일(Din_T)의 채널의 개수를 가리키고, Tw는 입력 타일(Din_T)의 너비를 가리키고, Th는 입력 타일(Din_T)의 높이를 가리킨다. Tn, Tw, 및 Th는 MAC 연산기(120)의 연산 능력, 입력 버퍼 장치(110)의 크기, 커널의 크기, 또는 커널의 개수에 따라 결정될 수 있다.3 and 4 , the input buffer device 110 may load an input tile Din_T that is a part of input data Din. In this case, the input tile Din_T may have a size of Tn×Tw×Th. Tn indicates the number of channels of the input tile Din_T, Tw indicates the width of the input tile Din_T, and Th indicates the height of the input tile Din_T. Tn, Tw, and Th may be determined according to the computing power of the MAC operator 120 , the size of the input buffer device 110 , the size of the kernel, or the number of kernels.

MAC 코어(121)는 가중치 커널 버퍼 장치(140)로부터의 복수의 커널(KER_1~KER_M)을 사용하여 입력 버퍼 장치(110)로 로드된 입력 타일(Din_T)에 대한 컨볼루션 연산을 수행할 수 있다. 예를 들어, MCC 코어(121)는 도 2를 참조하여 설명된 바와 같이 컨볼루션 연산을 수행할 수 있다. MAC 코어(121)는 컨볼루션 연산을 수행하여, 출력 타일(Dout_T)을 생성할 수 있다. The MAC core 121 may perform a convolution operation on the input tile Din_T loaded into the input buffer device 110 using a plurality of kernels KER_1 to KER_M from the weight kernel buffer device 140 . . For example, the MCC core 121 may perform a convolution operation as described with reference to FIG. 2 . The MAC core 121 may generate an output tile Dout_T by performing a convolution operation.

생성된 출력 타일(Dout_T)은 출력 버퍼 장치(130)로 로드될 수 있다. 예시적으로, 출력 타일(Dout_T)은 Tm×Tc×Tr의 크기를 가질 수 있다. Tm은 출력 타일(Dout_T)의 채널 개수를 가리키고, Tc는 출력 타일(Dout_T)의 너비를 가리키고, Tr은 출력 타일(Dout_T)의 높이를 가리킬 수 있다. Tm, Tc, 및 Tr은 입력 타일(Din_T)의 크기 및 커널들의 크기에 따라 결정될 수 있다. 예시적으로, 출력 버퍼 장치(130)에 저장된 출력 타일(Dout_T)은 외부 메모리 장치(101)로 제공될 수 있다.The generated output tile Dout_T may be loaded into the output buffer device 130 . For example, the output tile Dout_T may have a size of Tm×Tc×Tr. Tm may indicate the number of channels of the output tile Dout_T, Tc may indicate the width of the output tile Dout_T, and Tr may indicate the height of the output tile Dout_T. Tm, Tc, and Tr may be determined according to the size of the input tile Din_T and the sizes of kernels. For example, the output tile Dout_T stored in the output buffer device 130 may be provided to the external memory device 101 .

예시적으로, 입력 데이터(Din)에 대한 다른 입력 타일들에 대하여, 상술된 컨볼루션 연산이 반복적으로 수행되고, 반복 수행의 결과를 조합함으로써, 출력 데이터(Dout)가 생성될 수 있다. For example, the above-described convolution operation may be repeatedly performed on other input tiles for the input data Din, and output data Dout may be generated by combining results of the iterative execution.

예시적으로, 상술된 바와 같이, 부분 연산을 위하여, 입력 데이터(Din)가 일정한 크기(즉, 일정한 타일 단위)로 분할되고, 분할된 입력 타일 별로 상술된 컨볼루션 연산이 수행될 수 있다. 따라서, 메모리 대역폭, 메모리 용량 등과 같은 하드웨어 제한에 영향을 받지 않기 때문에, 입력 데이터에 대한 연산이 효율적으로 수행될 수 있다.Exemplarily, as described above, for the partial operation, the input data Din may be divided into a predetermined size (ie, a predetermined tile unit), and the above-described convolution operation may be performed for each divided input tile. Accordingly, since it is not affected by hardware limitations such as memory bandwidth and memory capacity, an operation on input data can be efficiently performed.

예시적으로, 상술된 부분 컨볼루션 연산의 흐름은 표 3과 같이 표현될 수 있다. 표 3에 기재된 알고리즘 구성 또는 프로그램 코드는 부분 컨볼루션 연산의 흐름을 예시적으로 보여주기 위한 것이며, 본 발명의 범위가 이에 한정되는 것은 아니다. Exemplarily, the flow of the above-described partial convolution operation may be expressed as shown in Table 3. The algorithm configuration or program code described in Table 3 is for illustrating the flow of the partial convolution operation by way of example, and the scope of the present invention is not limited thereto.

// Basic convolution computation
for ( row=0 ; row<R ; row+=Tr) {
for ( col=0 ; col<C ; col+=Tc) {
for ( to=0 ; to<M ; to+=Tm) {
for ( ti=0 ; ti<N ; ti+=Tn) {
// load tiled input
// load tiled weights
// load tiled output
// on-chip data computation
for ( trr=row ; trr <min (row+Tr , R) ; trr++) {
for ( tcc=col ; tcc <min (col+Tc , C ) ; tcc++) {
for ( too=to ; too<min ( to+Tm, M) ; too++) {
for ( tii = ti ; tii <min ( ti +Tn , N) ; tii++) {
for ( i=0 ; i<K ; i++) {
for ( j=0 ; j<K ; j++) {
output [too] [trr] [tcc] +=
weights [too] [tii] [i] [j] *
input [tii] [ S*trr+i] [ S*tcc+j] ;
}}}}}}
// store tiled output
}}}}// Basic convolution computation
for ( row=0 ; row<R ; row+=Tr) {
for ( col=0 ; col<C ; col+=Tc) {
for ( to=0 ; to<M ; to+=Tm) {
for ( ti=0 ; ti<N ; ti+=Tn) {
// load tiled input
// load tiled weights
// load tiled output
// on-chip data computation
for ( trr=row ; trr <min (row+Tr , R) ; trr++) {
for ( tcc=col ; tcc <min (col+Tc , C ) ; tcc++) {
for ( too=to ; too<min ( to+Tm, M) ; too++) {
for ( tii = ti ; tii < min ( ti +Tn , N) ; tii++) {
for ( i=0 ; i<K ; i++) {
for ( j=0 ; j<K ; j++) {
output [too] [trr] [tcc] +=
weights [too] [tii] [i] [j] *
input [tii] [ S*trr+i] [ S*tcc+j] ;
}}}}}}
// store tiled output
}}}}

예시적으로, 입력 타일(Din_T)의 높이를 가리키는 Th는 {Tr+K-1}로 표현될 수 있으며, 입력 타일(Din_T)의 너비를 가리키는 Tw는 {Tc+K-1}로 표현될 수 있다. 표 3에 도시된 알고리즘 구성에서는 Th와 Tw가 기재되지 않았으나, 실제 하드웨어로 구현 시에는 입력 버퍼 장치(110)의 크기로 표현될 수 있다.For example, Th indicating the height of the input tile Din_T may be expressed as {Tr+K-1}, and Tw indicating the width of the input tile Din_T may be expressed as {Tc+K-1}. have. Although Th and Tw are not described in the algorithm configuration shown in Table 3, when implemented in actual hardware, they may be expressed as the size of the input buffer device 110 .

표 3을 참조하면, Tr, Tc, Tm, Tn, 및 K의 변수들로 표현되는 부분 컨볼루션 루프 연산이 하드웨어 엔진으로 동작되고, 이러한 하드웨어 엔진은 입력 데이터(Din)의 전체적인 분할 개수(즉, 입력 타일(Din_T)의 개수)만큼 반복적으로 수행될 수 있다.Referring to Table 3, the partial convolution loop operation expressed by the variables of Tr, Tc, Tm, Tn, and K is operated as a hardware engine, and this hardware engine operates the total number of divisions of the input data Din (that is, It may be repeatedly performed as many as the number of input tiles (Din_T).

상술한 구성의 CNN 모델을 FPGA나 GPU 등의 하드웨어로 구현될 수 있다. 이때, 하드웨어 플랫폼의 자원, 동작 시간, 전력 소모 등을 고려하여 입력 및 입력 버퍼 장치(110)의 크기, 출력 버퍼 장치(130)의 크기, 가중치 커널 버퍼 장치(140)의 크기, 병렬 처리 MAC 코어들의 수, 그리고 메모리 액세스의 수가 결정되어야 한다.The CNN model of the above configuration may be implemented in hardware such as FPGA or GPU. At this time, the size of the input and input buffer device 110, the size of the output buffer device 130, the size of the weight kernel buffer device 140, and the parallel processing MAC core in consideration of resources, operating time, power consumption, etc. of the hardware platform The number of them, and the number of memory accesses, must be determined.

일반적인 신경망 설계를 위해서는 커널의 가중치들은 '0'이 아닌 값들(Non-Zero values)로 가득 차있다는 가정 하에서 설계 파라미터들이 결정된다. 즉, 지붕 천정(Roof top) 모델이 일반적인 신경망 설계 파라미터를 결정하기 위해 사용된다. 그러나, 모바일용 하드웨어 및 제한된 FPGA 상에서의 신경망 모델이 구현될 경우, 하드웨어 제한으로 인하여, 신경망의 크기를 줄이기 위한 방법 또는 장치가 요구된다. 많은 파라미터들을 필요로 하는 신경망 연산에서 전체 연산을 최소화하기 위해 신경망에 필요한 파라미터의 수 또는 크기를 줄이는 방법은 신경망 압축(Deep compression)이라 불린다. For general neural network design, design parameters are determined under the assumption that the kernel weights are filled with non-zero values. That is, a roof top model is used to determine general neural network design parameters. However, when a neural network model on a hardware for mobile and a limited FPGA is implemented, a method or apparatus for reducing the size of the neural network is required due to hardware limitations. In a neural network operation requiring many parameters, a method of reducing the number or size of parameters required for a neural network in order to minimize the overall operation is called deep compression.

상술된 신경망 압축을 통해, 컨볼루션 연산에서 사용되는 가중치 커널들이 희소 가중치(Sparse weight)의 형태로 압축될 수 있다. 희소 가중치는 압축된 신경망의 한 요소로서, 모든 뉴런들의 연결들을 표현하기보다 압축된 연결 또는 압축된 커널을 표현하도록 구성된다. 예를 들면, 2차원 K×K 사이즈의 가중치 커널에서, 가중치 값들 중 일부는 '0' 값을 갖도록 압축된다. 이 때, '0'을 갖지 않는 가중치는 희소 가중치(Sparse weight)라 불린다.Through the above-described neural network compression, weight kernels used in a convolution operation may be compressed in the form of sparse weights. A sparse weight is an element of a compressed neural network, which is configured to represent a compressed connection or a compressed kernel rather than representing connections of all neurons. For example, in a two-dimensional K×K weighted kernel, some of the weight values are compressed to have a value of '0'. In this case, a weight that does not have '0' is called a sparse weight.

희소 가중치를 갖는 커널(즉, 희소 가중치 커널)을 사용하면, CNN에서의 연산량이 감소될 수 있다. 즉, 가중치 커널 필터의 희소성에 따라서 전체 연산 처리량이 감소될 수 있다. 예를 들면, 2차원 K×K 사이즈의 가중치 커널에서 '0'이 전체 가중치들의 90%인 경우, 희소성은 90%라 할 수 있다. 따라서, 희소성은 90%의 가중치 커널을 사용하면, 실제 연산량은 일반적인 가중치 커널(즉, 비희소 가중치 커널)을 사용하는 연산량에 대비하여 10%로 감소하게 된다.When a kernel with sparse weights (ie, a sparse weight kernel) is used, the amount of computation in CNN can be reduced. That is, the overall computational throughput may be reduced according to the sparsity of the weighted kernel filter. For example, when '0' is 90% of all weights in a two-dimensional K×K weight kernel, the sparsity may be 90%. Therefore, when a 90% weight kernel is used for sparsity, the actual computation amount is reduced to 10% compared to the computation amount using a general weight kernel (ie, a non-sparse weight kernel).

도 5는 본 발명의 희소 가중치 커널을 설명하기 위한 도면이다. 간결한 설명을 위하여, 가중치 커널에서, K=3이며, 채널의 개수는 1인 것으로 가정한다. 즉, 가중치 커널은 1×3×3의 크기를 가질 것이다. 5 is a diagram for explaining a sparse weight kernel of the present invention. For concise description, it is assumed that K = 3 and the number of channels is 1 in the weighted kernel. That is, the weight kernel will have a size of 1×3×3.

도 5를 참조하면, 일반적인 신경망 모델에서의 완전 가중치 커널(KW)(full weight kernel)은 신경망 압축을 통해 희소 가중치 커널(SW)(sparse weight kernel)로 변환될 수 있다.Referring to FIG. 5 , a full weight kernel (KW) in a general neural network model may be converted into a sparse weight kernel (SW) through neural network compression.

K가 3인 경우, 완전 가중치 커널(KW)은 9개 가중치 값들(K₀~K₈)을 갖는 행렬로 표현될 수 있다. 신경망 압축 동작은 파라미터 제거, 가중치 공유, 양자화 등의 다양한 동작들을 포함할 수 있다. 파라미터 제거 기법은 입력 데이터 또는 히든 계층에서 일부 뉴런을 생략하는 기법이다. 가중치 공유 기법은 신경망 내에서 각각의 계층 별로 동일하거나 또는 비슷한 파라미터들을 단일 대표 값으로 맵핑함으로써, 파라미터들을 서로 공유하는 기법이다. 양자화 기법은 가중치나 입출력 계층 및 히든 계층의 데이터 크기를 양자화하는 방법이다. 하지만, 신경망 압축 동작은 상술된 기법들에 제한되지 않으며, 다른 다양한 압축 기법들을 포함할 수 있다.When K is 3, the full weight kernel KW may be expressed as a matrix having _{9 weight values K 0} to K _{8 .} The neural network compression operation may include various operations such as parameter removal, weight sharing, quantization, and the like. The parameter removal technique is a technique to omit some neurons from the input data or hidden layer. The weight sharing technique is a technique for sharing parameters by mapping the same or similar parameters to a single representative value for each layer in the neural network. The quantization technique is a method of quantizing data sizes of weights, input/output layers, and hidden layers. However, the neural network compression operation is not limited to the above-described techniques, and may include various other compression techniques.

완전 가중치 커널(KW)은 신경망 압축을 통해 일부 가중치 값들이 '0'을 가지는 희소 가중치 커널(SW)로 전환된다. 예를 들어, 신경망 압축에 의해, 완전 가중치 커널(KW)의 가중치 값들(K₀~K₈) 각각은 희소 가중치 커널(SW)의 가중치 값들(W₀~W₈)로 변환될 수 있다. 이 때, 다양한 알고리즘에 의해 희소 가중치 커널(SW)의 일부 가중치 값들(W₁, W₂, W₃, W₄, W₆, W₇, W₈)은 '0'의 값을 가질 수 있다. 즉, 희소 가중치 커널(SW)의 가중치 값들(W₀~W₈) 중 일부는 '0'의 값을 가질 수 있고, 나머지 일부는 '0'이 아닌 값(non-zero value)을 가질 수 있다. 이 때, '0'이 아닌 값들은 희소 가중치라 칭한다. The full weight kernel (KW) is converted into a sparse weight kernel (SW) in which some weight values have '0' through neural network compression. For example, by neural network compression, each of weight values K ₀ to K ₈ of the full weight kernel KW may be converted into _{weight values W 0} to W ₈ of the sparse weight kernel SW. In this case, some weight values W ₁ , W ₂ , W ₃ , W ₄ , W ₆ , W ₇ , W ₈ of the sparse weight kernel SW may have a value of '0' by various algorithms. That is, some of the weight values W ₀ to W ₈ of the sparse weight kernel SW may have a value of '0', and some of the weight values may have a non-zero value. . In this case, values other than '0' are referred to as sparse weights.

압축 신경망에서의 커널 특성은 희소 가중치(예를 들어, W₀, W₅)의 위치와 값에 의해서 결정될 수 있다. 실질적으로 MAC 코어들(121~12i, 도 3 참조)이 입력 타일과 가중치 커널의 컨볼루션 연산을 수행하는 경우, 가중치 커널에서 '0'인 값들에 대응하는 곱셈 연산 또는 덧셈 연산은 생략될 수 있다. 따라서, 희소 가중치들(W₀, W₅)에 대한 곱셈 연산 및 덧셈 연산만이 수행될 수 있다. 따라서, 희소 가중치 커널(SW)의 희소 가중치만을 사용하는 컨볼루션 연산에서의 계산량이 크게 감소된다. 완전 가중치가 아닌 희소 가중치만을 외부 메모리(201)와 교환하므로 메모리 액세스 횟수 또는 메모리 대역폭도 감소할 것이다.A kernel characteristic in a compressed neural network may be determined by the location and value of _{sparse weights (eg, W 0} , W _{5 ).} When the MAC cores 121 to 12i (refer to FIG. 3 ) actually perform a convolution operation between the input tile and the weight kernel, a multiplication operation or an addition operation corresponding to values '0' in the weight kernel may be omitted. . Accordingly, only multiplication and addition operations on the sparse weights W ₀ and W _{5 may be performed.} Accordingly, the amount of computation in the convolution operation using only the sparse weights of the sparse weight kernel SW is greatly reduced. Since only sparse weights, not full weights, are exchanged with the external memory 201, the number of memory accesses or memory bandwidth will also decrease.

예시적으로, 희소 가중치 커널을 사용하여 부분 컨볼루션 연산이 수행되는 경우, 표 3의 알고리즘은 표 4와 같이 변환될 수 있다. Exemplarily, when a partial convolution operation is performed using a sparse weight kernel, the algorithm of Table 3 may be transformed as shown in Table 4.

// on-chip data computation
for ( too=to ; too<min ( to+Tm, M) ; too++) {
for ( tii = ti ; tii <min ( ti +Tn , N) ; tii++) {
for ( s=0 ; s<NNZ(too,tii) ; s++) {
i=sparse_id×(too, tii, s) / K;
j=sparse_id×(too, tii, s) % K;
for ( trr=row ; trr <min (row+Tr , R) ; trr++) {
for ( tcc=col ; tcc <min (col+Tc , C ) ; tcc++) {
output [too] [trr] [tcc] +=
weights [too] [tii] [s] *
input [tii] [ S*trr+i] [ S*tcc+j] ;
}}}}}// on-chip data computation
for ( too=to ; too<min ( to+Tm, M) ; too++) {
for ( tii = ti ; tii < min ( ti +Tn , N) ; tii++) {
for ( s=0 ; s<NNZ(too,tii) ; s++) {
i=sparse_id×(too, tii, s) / K;
j=sparse_id×(too, tii, s) % K;
for ( trr=row ; trr <min (row+Tr , R) ; trr++) {
for ( tcc=col ; tcc <min (col+Tc , C ) ; tcc++) {
output [too] [trr] [tcc] +=
weights [too] [tii] [s] *
input [tii] [ S*trr+i] [ S*tcc+j] ;
}}}}}

표 4를 참조하면, 표 3의 알고리즘과 비교하면, 커널 단위(K×K)로 수행되는 루프(loop) 연산을 웨이트의 희소행렬에서 '0'이 아닌 NNZ(Number of Non-Zero)로 변경된다. 즉, 가중치 커널의 가중치 값들 중에 '0'인 가중치 값에 대한 연산이 수행되지 않기 때문에, 전체적인 연산량이 감소될 수 있다. 또한 연산에 필요한 MAC을 R×C 로 구현할 수 있기 때문에 일반적인 하드웨어 구성이 구현될 수 있다. Referring to Table 4, compared to the algorithm in Table 3, the loop operation performed in kernel units (K×K) is changed to NNZ (Number of Non-Zero) instead of '0' in the sparse matrix of weights. do. That is, since an operation is not performed on a weight value of '0' among weight values of the weight kernel, the overall amount of calculation may be reduced. In addition, since the MAC required for operation can be implemented in R×C, a general hardware configuration can be implemented.

도 6은 본 발명의 실시 예에 따른 CNN 시스템의 하드웨어 구성을 보여주는 블록도이다. 이하에서, 간결한 설명을 위하여, MAC 연산기(220)에서 사용되는 가중치 커널은 앞서 설명된 희소 가중치 커널(SW)인 것으로 가정한다. 또한, 본 발명의 실시 예들을 모호하게 하지 않기 위하여, 가중치 커널 이외의 다른 파라미터들(예를 들어, 바이어스 등)에 대한 설명은 생략한다.6 is a block diagram showing a hardware configuration of a CNN system according to an embodiment of the present invention. Hereinafter, for concise description, it is assumed that the weight kernel used in the MAC operator 220 is the above-described sparse weight kernel (SW). In addition, in order not to obscure the embodiments of the present invention, descriptions of parameters other than the weight kernel (eg, bias, etc.) will be omitted.

도 6을 참조하면, CNN 시스템(200)은 입력 버퍼 장치(210), MAC 연산기(220), 출력 버퍼 장치(230), 가중치 커널 버퍼 장치(240), 및 데이터 선택기(250)를 포함할 수 있다. MAC 연산기(220)는 복수의 MAC 코어(221~22i)를 포함할 수 있다. 예시적으로, MAC 코어들(221~22i) 각각은 도 2를 참조하여 설명된 MAC 코어(L1_1)와 유사한 동작을 수행하거나 또는 유사한 구조를 가질 수 있다. CNN 시스템(200)은 외부 메모리(201)와 입력 타일(Din_T), 출력 타일(Dout_T)을 교환하도록 구성될 수 있다.Referring to FIG. 6 , the CNN system 200 may include an input buffer device 210 , a MAC operator 220 , an output buffer device 230 , a weight kernel buffer device 240 , and a data selector 250 . have. The MAC operator 220 may include a plurality of MAC cores 221 to 22i. For example, each of the MAC cores 221 to 22i may perform an operation similar to that of the MAC core L1_1 described with reference to FIG. 2 or may have a similar structure. The CNN system 200 may be configured to exchange an input tile (Din_T) and an output tile (Dout_T) with the external memory 201 .

입력 버퍼 장치(210), MAC 연산기(220), 출력 버퍼 장치(230), 가중치 커널 버퍼 장치(240), 복수의 MAC 코어(221~22i), 및 외부 메모리(201)는 도 3 및 도 4를 참조하여 설명되었으므로, 이에 대한 상세한 설명은 생략된다.The input buffer device 210 , the MAC operator 220 , the output buffer device 230 , the weight kernel buffer device 240 , the plurality of MAC cores 221 to 22i , and the external memory 201 are shown in FIGS. 3 and 4 . Since it has been described with reference to, a detailed description thereof will be omitted.

CNN 시스템(200)은 도 3의 CNN 시스템(100)과 비교하여 데이터 선택기(250)를 더 포함할 수 있다. 데이터 선택기(250)는 입력 버퍼 장치(210)에 로드된 입력 타일(Din_T) 중 일부 데이터 값만 MAC 연산기(220)로 제공하도록 구성될 수 있다.The CNN system 200 may further include a data selector 250 compared to the CNN system 100 of FIG. 3 . The data selector 250 may be configured to provide only some data values among the input tiles Din_T loaded into the input buffer device 210 to the MAC operator 220 .

예를 들어, 가중치 커널 버퍼 장치(240)는 희소 가중치 커널(SW)을 포함할 수 있다. 데이터 선택기(250)는 가중치 커널 버퍼 장치(240)로부터 희소 가중치 커널(SW)의 희소 인덱스(SPI)를 수신하고, 수신된 희소 인덱스(SPI)를 기반으로 입력 타일(Din_T)의 데이터 값들 중 일부 데이터 값만 MAC 연산기(220)로 제공할 수 있다. 희소 인덱스(SPI)는 희소 가중치 커널(SW)에서, '0'이 아닌 값을 갖는 가중치의 위치에 대한 정보를 가리킨다. 예를 들어, 도 5에 도시된 희소 가중치 커널(SW)에 대한 희소 인덱스(SPI)는 W₀, W₅의 위치 정보(즉, {행, 열}의 형태로 {0,0}, {1,2}, 또는 단순 위치 형태(즉, 인덱스 번호)로 (0,5))를 가리킬 것이다.For example, the weight kernel buffer device 240 may include a sparse weight kernel SW. The data selector 250 receives the sparse index SPI of the sparse weight kernel SW from the weight kernel buffer device 240 , and some of the data values of the input tile Din_T based on the received sparse index SPI Only data values may be provided to the MAC operator 220 . The sparse index SPI indicates information on the position of a weight having a value other than '0' in the sparse weight kernel SW. For example, the sparse index (SPI) for the sparse weight kernel (SW) shown in FIG. 5 is {0,0}, {1 in the form of location information _{of W 0} , W _{5 (ie, {row, column})} ,2}, or (0,5)) in simple positional form (ie index number).

좀 더 상세한 예로써, 앞서 설명된 바와 같이, 가중치 커널이 희소 행렬로 구성된 희소 가중치 커널(SW)인 경우, '0'의 값을 갖는 가중치 값에 대한 곱셈 연산 또는 덧셈 연산은 생략될 수 있다. 즉, 데이터 선택기(250)는 희소 인덱스(SPI)를 기반으로 '0'이 아닌 가중치와 대응되는 데이터 값만 MAC 연산기(220)로 제공하고, MAC 연산기(220)는 제공된 데이터 값에 대한 덧셈 연산 또는 곱셈 연산을 수행할 수 있다. 따라서, '0'의 가중치와 대응되는 연산이 생략될 수 있다. As a more detailed example, as described above, when the weight kernel is a sparse weight kernel (SW) configured with a sparse matrix, a multiplication operation or an addition operation on a weight value having a value of '0' may be omitted. That is, the data selector 250 provides only a data value corresponding to a weight other than '0' to the MAC operator 220 based on the sparse index (SPI), and the MAC operator 220 performs an addition operation on the provided data value or Multiplication operations can be performed. Accordingly, an operation corresponding to a weight of '0' may be omitted.

예시적으로, 데이터 선택기(250)의 하드웨어 구성은 도 7 내지 도 9를 참조하여 더욱 상세하게 설명된다. 그러나, 데이터 선택기(250)의 구성이 이하에서 설명되는 다양한 하드웨어 구성에 제한되는 것은 아니며, 다양한 형태로 변형될 수 있다. Exemplarily, the hardware configuration of the data selector 250 will be described in more detail with reference to FIGS. 7 to 9 . However, the configuration of the data selector 250 is not limited to the various hardware configurations described below, and may be modified in various forms.

도 7은 도 6의 CNN 시스템을 더욱 상세하게 보여주는 블록도이다. 간결한 설명을 위하여, 하나의 입력 타일(Din_T)에 대한 CNN 시스템(200)의 구성이 도 7에 도시된다. 그러나, 본 발명의 범위가 이에 한정되는 것은 아니며, CNN 시스템(200)은 다른 입력 타일들 각각에 대한 구성 요소들을 더 포함하거나 또는 도 7에 도시된 구성 요소들을 기반으로 각 입력 타일에 대한 연산 동작을 반복 수행할 수 있다. 7 is a block diagram showing the CNN system of FIG. 6 in more detail. For concise description, the configuration of the CNN system 200 for one input tile (Din_T) is shown in FIG. 7 . However, the scope of the present invention is not limited thereto, and the CNN system 200 further includes components for each of the other input tiles or a calculation operation for each input tile based on the components shown in FIG. 7 . can be repeated.

도 6 및 도 7을 참조하면, CNN 시스템(200)은 입력 버퍼 장치(210), MAC 연산기(220), 출력 버퍼 장치(230), 가중치 커널 버퍼 장치(240), 및 데이터 선택기(250)를 포함할 수 있다. 입력 버퍼 장치(210), MAC 연산기(220), 출력 버퍼 장치(230), 가중치 커널 버퍼 장치(240), 및 데이터 선택기(250)는 도 6을 참조하여 설명되었으므로, 이에 대한 상세한 설명은 생략된다.6 and 7 , the CNN system 200 includes an input buffer device 210 , a MAC operator 220 , an output buffer device 230 , a weight kernel buffer device 240 , and a data selector 250 . may include Since the input buffer device 210 , the MAC operator 220 , the output buffer device 230 , the weight kernel buffer device 240 , and the data selector 250 have been described with reference to FIG. 6 , a detailed description thereof will be omitted. .

입력 버퍼 장치(210)는 복수의 입력 버퍼를 포함할 수 있다. 복수의 입력 버퍼 각각은 입력 타일(Din_T)의 데이터 값을 로드하도록 구성될 수 있다. 예를 들어, 입력 타일(Din_T)은 Tn×Tw×Th의 크기를 가질 수 있다. 입력 타일(Din_T)은 각 채널 별로 Tw×Th의 크기를 갖는 서브 입력 타일들로 구분될 수 있다. 서브 입력 타일의 각 데이터 값은 입력 버퍼들로 로드될 수 있다. 예시적으로, 가중치 커널의 채널 개수에 따라, 복수의 서브 입력 타일 각각의 데이터 값이 입력 버퍼들로 병렬로 로드될 수 있다.The input buffer device 210 may include a plurality of input buffers. Each of the plurality of input buffers may be configured to load a data value of the input tile Din_T. For example, the input tile Din_T may have a size of Tn×Tw×Th. The input tile Din_T may be divided into sub-input tiles having a size of Tw×Th for each channel. Each data value of the sub-input tile may be loaded into input buffers. For example, data values of each of the plurality of sub-input tiles may be loaded into input buffers in parallel according to the number of channels of the weight kernel.

데이터 선택기(250)는 스위치 회로(25A) 및 복수의 멀티플렉서(251~25i)(MUX; multiplexer)를 포함할 수 있다. 스위치(250)는 희소 가중치 커널(SW)을 기반으로, 복수의 입력 버퍼에 저장된 데이터 값을 복수의 MUX(251~25i) 각각으로 제공할 수 있다.The data selector 250 may include a switch circuit 25A and a plurality of multiplexers 251 to 25i (multiplexers). The switch 250 may provide the data values stored in the plurality of input buffers to each of the plurality of MUXs 251 to 25i based on the sparse weight kernel SW.

예를 들어, Tn=3, Th=3, 및 Tn=1이고, 희소 가중치 커널(SW)의 K=2이고, 스트라이드는 1인 것으로 가정하자. 이 경우, 입력 타일(Din_T)은 제0 내지 제8 입력 값들(I₀~I₈)을 갖는 행렬로 표현될 수 있으며, 제0 내지 제8 입력 값들(I₀~I₈) 각각은 제0 내지 제8 입력 버퍼들에 저장될 수 있다. 이 때, 스위치(25A)는 제0, 제1, 제3, 및 제4 입력 값들(I₀, I₁, I₃, I₄)이 제1 MUX(251)로 제공되도록 제0, 제1, 제3, 및 제4 입력 버퍼들을 제1 MUX(251)와 연결할 수 있다. 또한, 스위치(25A)는 제1, 제2, 제4, 및 제5 입력 값들(I₁, I₂, I₄, I₅)이 제2 MUX(252)로 제공되도록 제1, 제2, 제4, 및 제5 입력 버퍼들을 제2 MUX(252)와 연결할 수 있다. 마찬가지로, 스위치(25A)는 제3, 제4, 제6, 및 제7 입력 값들(I₃, I₄, I₆, I₇)이 제3 MUX(253)로 제공되도록 제3, 제4, 제6, 및 제7 입력 버퍼들을 제3 MUX(253)와 연결할 수 있다. 스위치(24A)는 상술된 방식을 통해, 희소 가중치 커널(SW)을 기반으로 복수의 입력 버퍼 및 복수의 MUX(251~25i)를 서로 연결할 수 있다. For example, assume that Tn=3, Th=3, and Tn=1, K=2 of the sparse weight kernel SW, and the stride equals 1. Each case, the input tile (Din_T) is the 0th to eighth input values (I ₀ ~ I ₈₎ the matrix can be represented as having, and the 0th to eighth input values (I ₀ ~ I ₈₎ is the 0th to eighth input buffers may be stored. At this time, the switch 25A is configured such that the 0th, 1st, 3rd, and 4th input values I ₀ , I ₁ , I ₃ , I ₄ are provided to the first MUX 251 such that the 0th, 1st, and 4th input values are provided to the first MUX 251 . , third, and fourth input buffers may be connected to the first MUX 251 . In addition, the switch 25A is configured such that the first, second, fourth, and fifth input values I ₁ , I ₂ , I ₄ , I ₅ are provided to the second MUX 252 . The fourth and fifth input buffers may be connected to the second MUX 252 . Similarly, the switch 25A is configured such that the third, fourth, sixth, and seventh input values I ₃ , I ₄ , I ₆ , I ₇ are provided to the third MUX 253 . The sixth and seventh input buffers may be connected to the third MUX 253 . The switch 24A may connect the plurality of input buffers and the plurality of MUXs 251 to 25i to each other based on the sparse weight kernel SW through the above-described method.

복수의 MUX(251~25i) 각각은 가중치 커널 버퍼 장치(240)로부터의 희소 인덱스(SPI)를 기반으로, 연결된 입력 버퍼들로부터의 데이터 값 중 어느 하나를 선택하여 MAC 연산기(220)의 MAC 코어들(221~22i)로 제공할 수 있다. 예를 들어, MUX들 각각(251~25i)는 희소 인덱스(SPI)를 기반으로 '0'이 아닌 가중치 위치와 대응되는 데이터 값을 선택하고, 선택된 데이터 값을 MAC 코어(221)로 전달할 수 있다. 좀 더 상세한 예로써, Tn=3, Th=3, 및 Tn=1이고, 희소 가중치 커널(SW)의 K=2이고, 스트라이드는 1이며, 희소 인덱스(SPI)(즉, '0'이 아닌 가중치의 위치)가 [0,0]인 것으로 가정하자. 이 경우, 앞서 설명된 바와 같이, 제0, 제1, 제3, 및 제4 데이터 값들(I₀, I₁, I₃, I₄)이 제1 MUX(251)로 제공될 것이다. 앞서 설명된 바와 같이, 희소 인덱스(SPI)가 [0,0]이므로, [0,0]의 위치와 대응하는 데이터 값 이외의 데이터 값에 대한 컨볼루션 연산은 생략될 수 있다. 다시 말해서, 희소 인덱스(SPI)가 가리키는 위치(즉, [0,0])와 대응되는 제0 데이터 값(I₀)에 대한 컨볼루션 연산은 수행되어야 하므로, MUX(251)는 희소 인덱스(SPI)가 가리키는 위치(즉, [0,0])와 대응되는 제0 데이터 값(I₀)을 선택하여, MAC 코어(221)로 제공할 수 있다. 다른 MUX들(252~25i) 또한 상술된 바와 유사한 동작을 수행할 수 있다.Each of the plurality of MUXs 251 to 25i selects any one of the data values from the connected input buffers based on the sparse index (SPI) from the weight kernel buffer device 240 to select the MAC core of the MAC operator 220 . It can be provided as the ones 221 to 22i. For example, each of the MUXs 251 to 25i may select a data value corresponding to a weight position other than '0' based on the sparse index (SPI), and transmit the selected data value to the MAC core 221 . . As a more specific example, if Tn=3, Th=3, and Tn=1, K=2 of the sparse weight kernel (SW), the stride is 1, and the sparse index (SPI) (ie, not '0') Assume that the weight position) is [0,0]. In this case, as described above, the zeroth, first, third, and fourth data values I ₀ , I ₁ , I ₃ , and I ₄ may be provided to the first MUX 251 . As described above, since the sparse index SPI is [0,0], a convolution operation on data values other than the data values corresponding to the position of [0,0] may be omitted. _{In other words, since a convolution operation on the zeroth data value (I 0} ) corresponding to the position (ie, [0,0]) indicated by the sparse index (SPI) must be performed, the MUX 251 has the sparse index (SPI) _{) may select the zeroth data value (I 0} ) corresponding to the position (ie, [0,0]) pointed to, and provide it to the MAC core 221 . Other MUXs 252 to 25i may also perform operations similar to those described above.

MAC 연산기(220)의 복수의 MAC 코어(221~22i) 각각은 수신된 데이터 값 및 희소 가중치 커널(SW)을 기반으로 곱셈 연산 및 덧셈 연산(즉, 컨볼루션 연산)을 수행하여, 출력 데이터를 출력할 수 있다.Each of the plurality of MAC cores 221 to 22i of the MAC operator 220 performs a multiplication operation and an addition operation (ie, a convolution operation) based on the received data value and the sparse weight kernel SW, and outputs data can be printed out.

출력 버퍼 장치(230)는 복수의 출력 버퍼를 포함하고, 복수의 출력 버퍼 각각은 복수의 MAC 코어(221~22i) 각각으로부터의 출력 데이터를 저장하거나 또는 누적할 수 있다. 예를 들어, MAC 연산기(220)는 상술된 바와 같이, 제1 희소 가중치 커널을 이용하여 입력 타일(Din_T)에 대한 컨볼루션 연산을 수행할 수 있다. 이 후, MAC 연산기(220)는 제1 희소 가중치 커널과 다른 제2 희소 가중치 커널을 이용하여 입력 타일(Din_T)에 대한 컨볼루션 연산을 수행할 수 있다. 제1 희소 가중치 커널을 이용한 컨볼루션 연산의 결과는 출력 타일(Dout_T)의 제1 채널일 수 있고, 제2 희소 가중치 커널을 이용한 컨볼루션 연산의 결과는 출력 타일(Dout_2)의 제2 채널일 수 있다. 즉, 출력 버퍼 장치(230)는 복수의 희소 가중치 커널을 사용하여 수행된 컨볼루션 연산의 결과들을 출력 타일(Dout_T)의 서로 다른 채널로써 저장하거나 또는 누적할 수 있다. 다시 말해서, 하나의 입력 타일(Din_T)에 대하여, M개의 희소 가중치 커널들을 사용하여 컨볼루션 연산이 수행된 경우, 출력 타일(Dout_T)은 M개의 채널을 가질 수 있다.The output buffer device 230 includes a plurality of output buffers, and each of the plurality of output buffers may store or accumulate output data from each of the plurality of MAC cores 221 to 22i. For example, as described above, the MAC operator 220 may perform a convolution operation on the input tile Din_T using the first sparse weight kernel. Thereafter, the MAC operator 220 may perform a convolution operation on the input tile Din_T using a second sparse weight kernel different from the first sparse weight kernel. The result of the convolution operation using the first sparse weight kernel may be the first channel of the output tile Dout_T, and the result of the convolution operation using the second sparse weight kernel may be the second channel of the output tile Dout_2 have. That is, the output buffer device 230 may store or accumulate results of a convolution operation performed using a plurality of sparse weight kernels as different channels of the output tile Dout_T. In other words, when a convolution operation is performed on one input tile Din_T using M sparse weight kernels, the output tile Dout_T may have M channels.

상술된 바와 같이, 본 발명에 따른 데이터 선택기(250)는 희소 가중치 커널(SW)의 희소 인덱스(SPI)를 기반으로, '0'이 아닌 가중치 값의 위치와 대응되는 데이터 값만 MAC 연산기(220)로 제공하기 때문에, '0'의 가중치 값의 위치와 대응되는 데이터 값들에 대한 컨볼루션 연산이 생략될 수 있다. 따라서, CNN 시스템(200)의 연산 효율이 향상된다.As described above, in the data selector 250 according to the present invention, based on the sparse index (SPI) of the sparse weight kernel (SW), only the data value corresponding to the position of the weight value other than '0' is the MAC operator 220 . Since . Accordingly, the computational efficiency of the CNN system 200 is improved.

도 8 및 도 9는 도 7의 CNN 시스템의 동작을 더욱 상세하게 설명하기 위한 도면들이다. 본 발명의 실시 예에 따른 하드웨어로 구현된 CNN 시스템(200)의 동작을 명확하게 설명하는데 불필요한 구성 요소들은 생략된다.8 and 9 are diagrams for explaining the operation of the CNN system of FIG. 7 in more detail. In order to clearly explain the operation of the CNN system 200 implemented in hardware according to an embodiment of the present invention, unnecessary components are omitted.

또한, 이하에서, 도면의 간결성 및 설명의 편의를 위하여, 특정한 데이터 조건들을 가정하여 설명하기로 한다. 도 7 내지 도 9를 참조하면, 입력 타일(Din_T)의 채널(Tn)은 1개이고, 너비(Tw)는 4이고, 높이(Th)는 4인 것으로 가정한다. 즉, 입력 타일(Din)은 1×4×4의 크기를 가지며, 도 8에 도시된 바와 같이, 제0 내지 제15 입력 값들(I₀~I₁₅)을 포함할 것이다. 제0 내지 제15 입력 값들(I₀~I₁₅)은 도 8에 도시된 바와 같이 행렬 형태로 표현될 수 있다. In addition, hereinafter, for the sake of brevity and convenience of description, specific data conditions will be assumed. 7 to 9 , it is assumed that the input tile Din_T has one channel Tn, the width Tw is 4, and the height Th is 4. That is, the input tile Din has a size of 1×4×4 and may include 0th to 15th input values I ₀ to I ₁₅ as shown in FIG. 8 . The zeroth to fifteenth input values I ₀ to I ₁₅ may be expressed in a matrix form as shown in FIG. 8 .

또한, 희소 가중치 커널(SW)의 너비 및 높이를 가리키는 K 값은 3이고, 스트라이드는 1인 것으로 가정한다. 즉 희소 가중치 커널(SW)은 1×3×3의 크기를 가지며, 제0 및 제8 가중치 값들(W₀~W₈)을 포함할 것이다. 제0 및 제8 가중치 값들(W₀~W₈)은 도 8에 도시된 바와 같이, 행렬 형태로 표현될 수 있다. 또한, 제0 및 제8 가중치 값들(W₀~W₈) 중 제1, 제2, 제3, 제4, 제6, 제7, 및 제8 가중치 값들(W₁, W₂, W₃, W₄, W₆, W₇, W₈)은 '0'이며, 제0 및 제5 가중치 값들(W₀, W₅)은 '0'이 아닌 것으로 가정한다. 즉, 희소 가중치 커널(SW)의 희소 인덱스(SPI)는 제0 및 제5 가중치 값들(W₀, W₅)의 위치와 대응될 것이다.Also, it is assumed that the K value indicating the width and height of the sparse weight kernel SW is 3, and the stride is 1. That is, the sparse weight kernel SW has a size of 1×3×3 and will include _{0th and 8th weight values W 0} to W _{8 .} The zeroth and eighth weight values W ₀ to W ₈ may be expressed in a matrix form as shown in FIG. 8 . In addition, the first, second, third, fourth, sixth, seventh, and eighth weight values W ₁ , W ₂ , W ₃ among the zeroth and eighth weight values W ₀ to W ₈ , It _{is assumed that W 4} , W ₆ , W ₇ , W ₈ ) is '0', and the zeroth and fifth weight values (W ₀ , W ₅ ) are not '0'. That is, the sparse index SPI of the sparse weight kernel SW may correspond to positions _{of the zeroth and fifth weight values W 0} and W _{5 .}

또한, 상술된 입력 타일(Din_T) 및 희소 가중치 커널(SW)을 기반으로 수행된 컨볼루션 연산의 결과인 출력 데이터(Dout_T)의 채널(Tm)은 1이고, 너비(Tc)는 2이고, 높이(Tr)는 2일 것이다. In addition, the channel Tm of the output data Dout_T, which is a result of the convolution operation performed based on the above-described input tile Din_T and the sparse weight kernel SW, is 1, the width Tc is 2, and the height (Tr) will be 2.

상술된 조건들은 본 발명의 구성 요소들의 동작을 명확하게 설명하기 위한 것이며, 본 발명의 범위가 이에 한정되는 것은 아니다. 입력 데이터, 입력 타일, 가중치 커널, 다른 파라미터들 등의 크기들 및 값들은 다양하게 변형될 수 있으며, 이러한 변형들에 따라, CNN 시스템(200)에 포함된 하드웨어 구성들의 개수 또는 구조가 변형될 수 있다.The above-mentioned conditions are for clearly explaining the operation of the elements of the present invention, and the scope of the present invention is not limited thereto. The sizes and values of input data, input tiles, weight kernel, other parameters, etc. may be variously modified, and according to these modifications, the number or structure of hardware components included in the CNN system 200 may be modified. have.

도 8에 도시된 입력 타일(Din_T) 및 희소 가중치 커널(SW)에 대하여, CNN 시스템(200)은 제0 내지 제3 컨볼루션 연산들(CON0~CON3)을 수행할 수 있다.With respect to the input tile Din_T and the sparse weight kernel SW shown in FIG. 8 , the CNN system 200 may perform 0th to 3rd convolution operations CON0 to CON3 .

예를 들어, 도 9에 도시된 바와 같이, 입력 타일(Din_T)의 제0 내지 제15 입력 값들(I₀~I₁₅)은 제0 내지 제15 입력 버퍼들(210_00~210_15)로 각각 로드될 수 있다. 스위치 회로(25A)는 희소 가중치 커널(SW)을 기반으로 제0 내지 제15 입력 버퍼들(210_00~210_15)을 MUX들(221~224)로 연결할 수 있다. MUX들(221~224) 각각은 희소 인덱스(SPI)를 기반으로 연결된 입력 버퍼들로부터의 입력 값들 중 하나를 선택하여 MAC 코어들(221~224)로 각각 제공할 수 있다. MAC 코어들(221~224) 각각은 수신된 입력 값 및 희소 가중치 커널(SW)을 사용하여 컨볼루션 연산을 수행할 수 있다.For example, as shown in FIG. 9 , the 0 to 15th input values I ₀ to I ₁₅ of the input tile Din_T are respectively loaded into the 0 to 15th input buffers 210_00 to 210_15. can The switch circuit 25A may connect the 0th to 15th input buffers 210_00 to 210_15 to the MUXs 221 to 224 based on the sparse weight kernel SW. Each of the MUXs 221 to 224 may select one of input values from connected input buffers based on a sparse index (SPI) and provide it to the MAC cores 221 to 224, respectively. Each of the MAC cores 221 to 224 may perform a convolution operation using the received input value and the sparse weight kernel SW.

예시적으로, 스위치 회로(25A)는 희소 가중치 커널(SW) 및 입력 타일(Din_T)의 크기(즉, Tn×Tw×Th)를 기반으로 복수의 입력 버퍼 및 복수의 MUX를 서로 연결할 수 있다. 그러나, 본 발명의 실시 예를 명확하게 설명하기 위하여, 입력 타일(Din_T)의 크기를 특정 크기로 가정하였으므로, 도 6 내지 도 7에서 이와 같은 구성은 별도로 도시되지 않는다. 그러나, 본 발명의 범위가 이에 한정되는 것은 아니며, 스위치 회로(25A)의 구성 또는 스위치 회로(25A)에 의한 연결 관계는 희소 가중치 커널(SW) 및 입력 타일(Din_T)의 크기(즉, Tn×Tw×Th)를 기반으로 다양하게 변형될 수 있다.For example, the switch circuit 25A may connect the plurality of input buffers and the plurality of MUXs to each other based on the size (ie, Tn×Tw×Th) of the sparse weight kernel SW and the input tile Din_T. However, in order to clearly describe the embodiment of the present invention, since the size of the input tile Din_T is assumed to be a specific size, such a configuration is not separately illustrated in FIGS. 6 to 7 . However, the scope of the present invention is not limited thereto, and the configuration of the switch circuit 25A or the connection relationship by the switch circuit 25A depends on the size of the sparse weight kernel SW and the input tile Din_T (ie, Tn × Tw × Th) may be variously modified.

이하에서, 좀 더 구체적인 데이터 선택기(250)의 동작 및 컨볼루션 연산이 설명된다. Hereinafter, a more specific operation of the data selector 250 and the convolution operation are described.

제0 컨볼루션 연산(CON0)은 MAC 코어(221)에 의해 수행될 수 있다. 예를 들어, 입력 타일(Din_T)의 제0, 제1, 제2, 제4, 제5, 제6, 제8, 제9, 및 제10 입력 값들(I₀, I₁, I₂, I₄, I₅, I₆, I₈, I₉, I₁₀) 및 희소 가중치 커널(SW)을 기반으로 제0 컨볼루션 연산(CON0)이 수행되고, 제0 컨볼루션 연산(CON0)의 결과로써, 제0 출력 값(R₀)이 생성될 수 있다.The 0th convolution operation CON0 may be performed by the MAC core 221 . For example, 0th, 1st, 2nd, 4th, 5th, 6th, 8th, 9th, and 10th input values I ₀ , I ₁ , I ₂ , I of the input tile Din_T ₄ , I ₅ , I ₆ , I ₈ , I ₉ , I ₁₀ ) and the sparse weight kernel (SW), a 0th convolution operation (CON0) is performed, and as a result of the 0th convolution operation (CON0) , a zeroth output value R ₀ may be generated.

예를 들어, 앞서 설명된 바와 같이, 스위치 회로(25A)는 제0, 제1, 제2, 제4, 제5, 제6, 제8, 제9, 및 제10 입력 값들(I₀, I₁, I₂, I₄, I₅, I₆, I₈, I₉, I₁₀)이 MUX(221)로 제공되도록, 입력 버퍼들(210_00, 210_01, 210_02, 210_04, 210_05, 210_06, 210_08, 210_09, 210_10)을 MUX(221)와 연결(도 9의 스위치 회로(25A) 내의 실선 참조)할 수 있다. MUX(221)는 희소 인덱스(SPI)를 기반으로 연결된 입력 버퍼들로부터의 입력 값들 중 하나를 선택하여 MAC 코어(221)로 제공할 수 있다.For example, as described above, the switch circuit 25A may use the 0th, 1st, 2nd, 4th, 5th, 6th, 8th, 9th, and 10th input values I ₀ , I ₁ , I ₂ , I ₄ , I ₅ , I ₆ , I ₈ , I ₉ , I ₁₀ ) are provided to the MUX 221 , so that the input buffers 210_00, 210_01, 210_02, 210_04, 210_05, 210_06, 210_08, 210_09 and 210_10 may be connected to the MUX 221 (refer to the solid line in the switch circuit 25A of FIG. 9 ). The MUX 221 may select one of input values from the connected input buffers based on the sparse index (SPI) and provide it to the MAC core 221 .

도 8을 참조하여 설명된 바와 같이, 희소 인덱스(SPI)는 제0 및 제5 가중치 값들(W₀, W₅)의 위치와 대응될 수 있다. 이 경우, 제0 컨볼루션 연산(CON0)에서, 제0 입력 데이터(I₀)가 제0 가중치 값(W₀)의 위치와 대응되고, 제6 입력 데이터(I₆)가 제5 가중치 값(W₅)의 위치와 대응될 것이다. 이 경우, MUX(221)는 제0 가중치 값(W₀)의 위치와 대응되는 제0 입력 데이터(I₀)를 먼저 출력할 것이다. MAC 코어(221)는 수신된 제0 입력 데이터(I₀) 및 희소 가중치 커널(SW)의 제0 가중치 값(W₀)에 대한 곱셈 연산을 수행하고, 그 결과를 내부 레지스터에 저장할 것이다. 이후, MUX(221)는 희소 인덱스(SPI)를 기반으로 제5 가중치 값(W₅)의 위치와 대응되는 제6 입력 데이터(I₆)를 출력할 것이다. MAC 코어(221)는 제6 입력 데이터(I₆) 및 희소 가중치 커널(SW)의 제5 가중치 값(W₅)의 곱셈 연산을 수행하고, 곱셈 연산의 결과 및 레지스터에 저장된 값(즉, 제0 입력 데이터(I₀) 및 제0 가중치 값(W₀)의 곱셈 연산의 결과 값)을 누적하는 덧셈 연산을 수행할 수 있다. 연산 결과는 내부 레지스터에 저장될 수 있다. 이후, 희소 인덱스(SPI)에 포함된 위치들과 대응되는 입력 값에 대한 연산이 모두 수행되었으므로, 제0 컨볼루션 연산(CON0)은 종료되고, 연산 결과는 제0 출력 값(R0)으로써, 출력 버퍼(230_00)로 제공된다. As described with reference to FIG. 8 , the sparse index SPI may correspond to positions _{of the zeroth and fifth weight values W 0} and W _{5 .} In this case, in the 0th convolution operation (CON0), the 0th input data (I ₀ ) corresponds to the position of the 0th weight value (W ₀ ), and the sixth input data (I ₆ ) is the fifth weight value ( It will correspond to the position of W _{5 ).} In this case, the MUX 221 will first output the zeroth input data I ₀ _{corresponding to the position of the zeroth weight value W 0 .} The MAC core 221 performs a multiplication operation on the received zeroth input data I ₀ and the zeroth weight value W ₀ of the sparse weight kernel SW, and stores the result in an internal register. Thereafter, the MUX 221 may output the sixth input data I ₆ _{corresponding to the position of the fifth weight value W 5 based on the sparse index SPI.} The MAC core 221 performs a multiplication operation of the sixth input data I ₆ _{and the fifth weight value W 5} of the sparse weight kernel SW, and the result of the multiplication operation and the value stored in the register (ie, the first An addition operation of accumulating 0 input data (I ₀ ) and a result value of a multiplication operation of the 0th weight value (W _{0 ) may be performed.} The operation result may be stored in an internal register. Thereafter, since all operations on the positions included in the sparse index SPI and corresponding input values are performed, the 0th convolution operation CON0 is terminated, and the operation result is output as the 0th output value R0. It is provided as a buffer 230_00.

제1 컨볼루션 연산(CON1)은 MAC 코어(222)에 의해 수행될 수 있다. 예를 들어, 입력 타일(Din_T)의 제1, 제2, 제3, 제5, 제6, 제7, 제9, 제10, 및 제11 입력 값들(I₁, I₂, I₃, I₅, I₆, I₇, I₉, I₁₀,I₁₁) 및 희소 가중치 커널(SW)을 기반으로 제1 컨볼루션 연산(CON1)이 수행되고, 제1 컨볼루션 연산(CON1)의 결과로써, 제1 출력 값(R₁)이 생성될 수 있다.The first convolution operation CON1 may be performed by the MAC core 222 . For example, first, second, third, fifth, sixth, seventh, ninth, tenth, and eleventh input values I ₁ , I ₂ , I ₃ , I of the input tile Din_T ₅ , I ₆ , I ₇ , I ₉ , I ₁₀ ,I ₁₁ ) and the sparse weight kernel SW, a first convolution operation CON1 is performed, and as a result of the first convolution operation CON1 , a first output value R ₁ may be generated. .

예를 들어, 앞서 설명된 바와 같이, 스위치 회로(25A)는, 제1, 제2, 제3, 제5, 제6, 제7, 제9, 제10, 및 제11 입력 값들(I₁, I₂, I₃, I₅, I₆, I₇, I₉, I₁₀,I₁₁)이 MUX(221)로 제공되도록, 입력 버퍼들(210_01, 210_02, 210_03, 210_05, 210_06, 210_07, 210_09, 210_10, 210_11)을 MUX(222)와 연결(도 9의 스위치 회로(25A) 내의 제1 파선 참조)할 수 있다. MUX(222)는 희소 인덱스(SPI)를 기반으로 연결된 입력 버퍼들로부터의 입력 값들 중 하나를 선택하여 MAC 코어(222)로 제공할 수 있다.For example, as described above, the switch circuit 25A may include first, second, third, fifth, sixth, seventh, ninth, tenth, and eleventh input values I ₁ , I ₂ , I ₃ , I ₅ , I ₆ , I ₇ , I ₉ , I ₁₀ ,I ₁₁ ) is provided to the MUX 221, and the input buffers 210_01, 210_02, 210_03, 210_05, 210_06, 210_07, 210_09, 210_10, 210_11 are connected to the MUX 222 (switch circuit 25A in FIG. 9) See the first dashed line in). The MUX 222 may select one of input values from the connected input buffers based on the sparse index (SPI) and provide it to the MAC core 222 .

도 8을 참조하여 설명된 바와 같이, 희소 인덱스(SPI)는 제0 및 제5 가중치 값들(W₀, W₅)의 위치와 대응될 수 있다. 이 경우, 제1 컨볼루션 연산(CON1)에서, 제1 입력 데이터(I₁)가 제0 가중치 값(W₀)의 위치와 대응되고, 제7 입력 데이터(I₇)가 제5 가중치 값(W₅)의 위치와 대응될 것이다. 제0 컨볼루션 연산(CON0)에서 설명된 바와 유사하게, MUX(222)는 제1 및 제7 입력 값들(I₁, I₇)을 순차적으로 MAC 코어(222)로 전송하고, MAC 코어(222)는 희소 가중치 커널(SW)을 기반으로 제1 및 제7 입력 값들(I₁, I₇)에 대한 제1 컨볼루션 연산(CON1)을 수행할 것이다. 제1 컨볼루션 연산(CON1)의 결과로써 제1 출력 값(R₁)이 생성되고, 제1 출력 값(R₁)은 출력 버퍼(230_1)로 제공될 것이다.As described with reference to FIG. 8 , the sparse index SPI may correspond to positions _{of the zeroth and fifth weight values W 0} and W _{5 .} In this case, in the first convolution operation CON1 , the first input data I ₁ corresponds to the position of the zeroth weight value W ₀ , and the seventh input data I ₇ corresponds to the fifth weight value ( It will correspond to the position of W _{5 ).} Similar to that described in the 0th convolution operation (CON0), the MUX 222 _{sequentially transmits the first and seventh input values I 1} and I ₇ to the MAC core 222, and the MAC core 222 ) will perform the first convolution operation CON1 on the first and seventh input values I ₁ and I _{7 based on the sparse weight kernel SW.} As a result of the first convolution operation CON1 , a first output value R ₁ will be generated, and the first output value R ₁ will be provided to the output buffer 230_1 .

제0 및 제1 컨볼루션 연산들(CON0, CON1)에서 설명된 바와 유사하게, MAC 코어들(223, 224)은 제2 및 제3 컨볼루션 연산들(CON2, CON3)을 수행할 수 있다. 스위치 회로(25A)는 희소 가중치 커널(SW)을 기반으로 제4, 제5, 제6, 제8, 제9, 제10, 제12, 제13, 및 제14 입력 값들(I₄, I₅, I₆, I₈, I₉, I₁₀, I₁₂, I₁₃,I₁₄)이 MUX(223)로 제공되도록, 입력 버퍼들(210_04, 210_05, 210_06, 210_08, 210_09, 210_10, 210_12, 210_13, 210_14)을 MUX(223)와 연결(도 9의 스위치 회로(25A) 내의 제2 파선 참조)할 수 있다. 스위치 회로(25A)는 희소 가중치 커널(SW)을 기반으로 제5, 제6, 제7, 제9, 제10, 제11, 제13, 제14, 및 제15 입력 값들(I₅, I₆, I₇, I₉, I₁₀, I₁₁, I₁₃,I₁₄, I₁₅)이 MUX(223)로 제공되도록, 입력 버퍼들(210_05, 210_06, 210_07, 210_09, 210_10, 210_11, 210_13, 210_14, 210_15)을 MUX(224)와 연결(도 9의 스위치 회로(25A) 내의 점선 참조)할 수 있다.Similar to that described in the 0th and first convolution operations CON0 and CON1 , the MAC cores 223 and 224 may perform the second and third convolution operations CON2 and CON3 . The switch circuit 25A provides fourth, fifth, sixth, eighth, ninth, tenth, twelfth, thirteenth, and fourteenth input values I ₄ , I _{5 based on the sparse weight kernel SW.} , I ₆ , I ₈ , I ₉ , I ₁₀ , I ₁₂ , I ₁₃ ,I ₁₄ ) is provided to the MUX 223 , and the input buffers 210_04, 210_05, 210_06, 210_08, 210_09, 210_10, 210_12, 210_13, 210_14 are connected to the MUX 223 (switch circuit 25A in FIG. 9 ) See the second dashed line in). The switch circuit 25A provides fifth, sixth, seventh, ninth, tenth, eleventh, thirteenth, fourteenth, and fifteenth input values I ₅ , I _{6 based on the sparse weight kernel SW.} , I ₇ , I ₉ , I ₁₀ , I ₁₁ , I ₁₃ ,The input buffers 210_05, 210_06, 210_07, 210_09, 210_10, 210_11, 210_13, 210_14, 210_15 are connected to the MUX 224 (switch circuit of FIG. 9 ) so that _{I 14} and I _{15 are provided to the MUX 223 .} (See the dotted line in (25A)).

제2 컨볼루션 연산(CON2)을 위하여, MUX(223)는 희소 인덱스(SPI)를 기반으로 제4 및 제10 입력 값들(I₄, I₁₀)을 순차적으로 출력하고, MAC 코어(223) 희소 가중치 커널(SW)을 기반으로 제4 및 제10 입력 값들(I₄, I₁₀)에 대한 제2 컨볼루션 연산(CON2)을 수행할 것이다. 제2 컨볼루션 연산(CON2)의 결과인 제2 출력 값(R₂)은 출력 버퍼(230_02)에 저장될 수 있다. For the second convolution operation CON2, the MUX 223 _{sequentially outputs the fourth and tenth input values I 4} and I ₁₀ based on the sparse index SPI, and the MAC core 223 sparseness. A second convolution operation CON2 will be performed on the fourth and tenth input values I ₄ and I _{10 based on the weight kernel SW.} _{The second output value R 2 that} is the result of the second convolution operation CON2 may be stored in the output buffer 230_02.

제3 컨볼루션 연산(CON3)을 위하여, MUX(224)는 희소 인덱스(SPI)를 기반으로 제5 및 제11 입력 값들(I₅, I₁₁)을 순차적으로 출력하고, MAC 코어(224) 희소 가중치 커널(SW)을 기반으로 제5 및 제111 입력 값들(I₅, I₁₁)에 대한 제3 컨볼루션 연산(CON3)을 수행할 것이다. 제3 컨볼루션 연산(CON3)의 결과인 제3 출력 값(R₃)은 출력 버퍼(230_03)에 저장될 수 있다.For the third convolution operation CON3, the MUX 224 _{sequentially outputs the fifth and eleventh input values I 5} and I ₁₁ based on the sparse index SPI, and the MAC core 224 sparseness. A third convolution operation CON3 will be performed on the fifth and 111th input values I ₅ and I _{11 based on the weight kernel SW.} _{The third output value R 3 that} is the result of the third convolution operation CON3 may be stored in the output buffer 230_03.

상술된 실시 예에서는, 설명의 편의 및 명확성을 위하여, 제0 내지 제3 컨볼루션 연산들(CON0~CON3)이 서로 구분되어 설명되었으나, 본 발명의 범위가 이에 한정되는 것은 아니며, 제0 내지 제3 컨볼루션 연산들(CON0~CON3)은 서로 병렬로 수행될 수 있다. 예를 들어, 입력 타일(Din_T)의 입력 값들(I₀~I₁₅)이 입력 버퍼들(210_00~210_15)로 로드되고, 스위치 회로(25A)는 희소 가중치 커널(SW)을 기반으로 입력 버퍼들(210_00~210_15) 및 MUX들(221~224) 사이의 연결을 앞서 설명된 바와 같이 구성할 수 있다. 이 후, MUX들(221~224) 각각은 제0 가중치 값(W₀)의 위치와 대응되는 제0, 제1, 제4, 및 제5 입력 값들(I₀, I₁, I₄, I₅)을 제1 데이터 세트(D1)로써 출력할 수 있다. MAC 코어들(221~224)은 제0, 제1, 제4, 및 제5 입력 값들(I₀, I₁, I₄, I₅) 및 희소 가중치 커널(SW)을 기반으로 각각 컨볼루션 연산을 수행할 수 있다. 이 후, MUX들(221~224) 각각은 제5 가중치 값(W₅)의 위치와 대응되는 제6, 제7, 제10, 및 제11 입력 값들(I₆, I₇, I₁₀, I₁₁)을 제2 데이터 세트(D2)로써 출력할 수 있다. MAC 코어들(221~240)은 제6, 제7, 제10, 및 제11 입력 값들(I₆, I₇, I₁₀, I₁₁) 및 희소 가중치 커널(SW)을 기반으로 각각 컨볼루션 연산을 수행할 수 있다.In the above-described embodiment, for convenience and clarity of explanation, the 0th to 3rd convolution operations CON0 to CON3 have been described separately from each other, but the scope of the present invention is not limited thereto, and the 0th to 3rd convolution operations CON0 to CON3 are not limited thereto. 3 convolution operations CON0 to CON3 may be performed in parallel with each other. For example, the input values I ₀ to I ₁₅ of the input tile Din_T are loaded into the input buffers 210_00 to 210_15 , and the switch circuit 25A performs input buffers based on the sparse weight kernel SW. Connections between (210_00 to 210_15) and the MUXs 221 to 224 may be configured as described above. Thereafter, each of the MUXs 221 to 224 includes the 0th, 1st, 4th, and 5th input values I ₀ , I ₁ , I ₄ , I _{corresponding to the position of the 0th weight value W 0 .} ₅ ) may be output as the first data set D1. The MAC cores 221 to 224 each perform a convolution operation based on the 0th, 1st, 4th, and 5th input values I ₀ , I ₁ , I ₄ , I _{5 and the sparse weight kernel SW.} can be performed. Thereafter, each of the MUXs 221 to 224 includes the sixth, seventh, tenth, and eleventh input values I ₆ , I ₇ , I ₁₀ , I _{corresponding to the position of the fifth weight value W 5 .} ₁₁ ) may be output as the second data set D2. The MAC cores 221 to 240 each perform a convolution operation based on the sixth, seventh, tenth, and eleventh input values I ₆ , I ₇ , I ₁₀ , I _{11 and the sparse weight kernel SW.} can be performed.

즉, 데이터 선택부(250)는 희소 인덱스(SPI)를 기반으로 복수의 커널 영역에서, 하나의 가중치 값의 위치와 대응하는 입력 값을 출력하고, MAC 연산기(220)는 희소 가중치 커널(SW)을 기반으로, 수신된 입력 값에 대한 컨볼루션 연산을 수행한다. 데이터 선택부(250)는 희소 인덱스(SPI)를 기반으로 '0'이 아닌 가중치 값의 위치와 대응되는 입력 데이터만 출력하기 때문에, (다시 말해서, '0'의 가중치 값의 위치와 대응되는 입력 값을 출력하지 않기 때문에,) 0의 가중치 값에 대응하는 컨볼루션 연산이 생략될 수 있다. 즉, 가중치 커널의 가중치 값들에서 '0'의 개수가 증가할수록 컨볼루션 연산 감소 효과는 증가하며, 이에 따라, CNN 시스템의 전체적인 성능이 향상될 수 있다.That is, the data selection unit 250 outputs an input value corresponding to the position of one weight value in a plurality of kernel regions based on the sparse index (SPI), and the MAC operator 220 outputs the sparse weight kernel (SW). Based on , a convolution operation is performed on the received input value. Since the data selection unit 250 outputs only input data corresponding to the position of the weight value other than '0' based on the sparse index (SPI) (that is, input corresponding to the position of the weight value of '0') Since a value is not output, a convolution operation corresponding to a weight value of 0) may be omitted. That is, as the number of '0's in the weight values of the weight kernel increases, the effect of reducing the convolution operation increases, and thus the overall performance of the CNN system can be improved.

상술된 본 발명의 실시 예들은 하나의 컨볼루션 계층에서 수행되는 연산 동작을 보여준다. 그러나, 본 발명의 범위가 이에 한정되는 것은 아니며, 본 발명에 따른 CNN 시스템은 상술된 실시 예들에 따른 연산 동작 또는 컨볼루션 계층을 반복 수행할 수 있다. The above-described embodiments of the present invention show an operation performed in one convolutional layer. However, the scope of the present invention is not limited thereto, and the CNN system according to the present invention may repeatedly perform the arithmetic operation or the convolutional layer according to the above-described embodiments.

도 10은 본 발명에 따른 컨볼루션 신경망 시스템의 동작을 간략하게 보여주는 순서도이다. 도 6, 도 7, 및 도 10을 참조하면, S110 단계에서, CNN 시스템(200)은 입력 타일을 저장할 수 있다. 예를 들어, CNN 시스템(200)의 입력 버퍼 장치(210)는 앞서 설명된 바와 같이, 외부 메모리(201)로부터 입력 데이터(Din)의 일부인 입력 타일(Din_T)을 저장할 수 있다.10 is a flowchart briefly showing the operation of the convolutional neural network system according to the present invention. 6, 7, and 10 , in step S110, the CNN system 200 may store an input tile. For example, as described above, the input buffer device 210 of the CNN system 200 may store the input tile Din_T, which is a part of the input data Din, from the external memory 201 .

S120 단계에서, CNN 시스템(200)은 입력 타일의 입력 값들을 복수의 MUX(251~25i)로 연결할 수 있다. 예를 들어, 도 7을 참조하여 설명된 바와 같이, CNN 시스템(200)의 스위치 회로(25A)는 희소 가중치 커널(SW)을 기반으로 입력 타일(Din_T)의 입력 값들을 복수의 MUX(251~25i)로 연결할 수 있다.In step S120 , the CNN system 200 may connect the input values of the input tile to a plurality of MUXs 251 to 25i. For example, as described with reference to FIG. 7 , the switch circuit 25A of the CNN system 200 converts the input values of the input tile Din_T based on the sparse weight kernel SW to a plurality of MUXs 251 to 25i) can be connected.

S130 단계에서, CNN 시스템(200)은 희소 인덱스를 기반으로 연결된 입력 값들 중 적어도 하나를 선택할 수 있다. 예를 들어, 도 7을 참조하여 설명된 바와 같이, 복수의 MUX(251~25i) 각각은 희소 인덱스(SPI)를 기반으로 희소 가중치의 위치와 대응되는 입력 값들을 선택할 수 있다. 이 때, 희소 가중치의 위치와 대응되지 않는 입력 값들(즉, '0'인 가중치의 위치와 대응되는 입력 값들)은 선택되지 않을 것이다.In step S130, the CNN system 200 may select at least one of the connected input values based on the sparse index. For example, as described with reference to FIG. 7 , each of the plurality of MUXs 251 to 25i may select input values corresponding to positions of sparse weights based on the sparse index SPI. In this case, input values that do not correspond to the position of the sparse weight (ie, input values that correspond to the position of the weight '0') will not be selected.

S140 단계에서, CNN 시스템(200)은 희소 가중치 커널을 사용하여 적어도 하나의 입력 값에 대한 컨볼루션 연산을 수행할 수 있다. 예를 들어, 도 7을 참조하여 설명된 바와 같이, MAC 연산기(220)의 복수의 MAC 코어(221~22i) 각각은 희소 가중치 커널을 사용하여, 복수의 MUX(251~25i) 각각으로부터 출력되는 입력 값에 대한 컨볼루션 연산을 수행할 수 있다.In step S140 , the CNN system 200 may perform a convolution operation on at least one input value using a sparse weight kernel. For example, as described with reference to FIG. 7 , each of the plurality of MAC cores 221 to 22i of the MAC operator 220 uses a sparse weight kernel to output from each of the plurality of MUXs 251 to 25i. Convolution operation can be performed on input values.

S150 단계에서, CNN 시스템(200)은 컨볼루션 연산의 결과를 저장 및 누적할 수 있다. 예를 들어, 도 7을 참조하여 설명된 바와 같이, 출력 버퍼 장치(230)는 MAC 연산기(220)로부터의 연산 결과를 저장할 수 있다. In step S150, the CNN system 200 may store and accumulate the results of the convolution operation. For example, as described with reference to FIG. 7 , the output buffer device 230 may store the operation result from the MAC operator 220 .

예시적으로, 복수의 희소 가중치 커널이 사용되는 경우, S130 단계 및 S140 단계는 복수의 희소 가중치 커널 각각에 대하여 반복 수행될 수 있다. 반복 수행의 결과는 출력 버퍼 장치(230)에 누적될 수 있다.For example, when a plurality of sparse weight kernels are used, steps S130 and S140 may be repeatedly performed for each of the plurality of sparse weight kernels. A result of the iterative execution may be accumulated in the output buffer device 230 .

S160 단계에서, CNN 시스템(200)은 누적된 컨볼루션 연산의 결과를 출력 타일로써 출력할 수 있다. 예를 들어, 도 6을 참조하여 설명된 바와 같이, 입력 타일(Din_T)에 대한 컨볼루션 연산이 모두 수행된 경우, 출력 버퍼 장치(230)는 연산 결과를 누적하여 출력 타일(Dout_T)로써 외부 메모리(201)로 제공할 수 있다.In step S160 , the CNN system 200 may output a result of the accumulated convolution operation as an output tile. For example, as described with reference to FIG. 6 , when all of the convolution operations on the input tile Din_T are performed, the output buffer device 230 accumulates the operation results to form an external memory as the output tile Dout_T. (201) can be provided.

예시적으로, CNN 시스템(200)은 입력 데이터(Din)의 전체 입력 타일들 각각에 대하여 상술된 동작들을 수행할 수 있으며, 이에 따라, 복수의 출력 타일을 외부 메모리(201)로 제공할 수 있다. 복수의 출력 타일이 서로 조합 또는 누적됨으로써, 최종 출력 데이터(Dout)가 생성될 수 있다. Illustratively, the CNN system 200 may perform the above-described operations for each of the entire input tiles of the input data Din, and thus may provide a plurality of output tiles to the external memory 201 . . Final output data Dout may be generated by combining or accumulating a plurality of output tiles.

상술된 바와 같이, 본 발명에 따른 CNN 시스템은 신경망 압축을 통해 연산에 요구되는 파라미터들의 개수 또는 크기를 감소시킬 수 있으며, 이에 따라, 요구되는 연산이 감소될 수 있다. 이 때, 본 발명에 따른 CNN 시스템은 가중치와 연관된 희소 인덱스를 사용하여 하드웨어 구성을 간결하게 할 수 있다. 이에 따라, 일반적으로 하드웨어 구성은 등배열로 구현되고 반복성을 갖게끔 동작이 되는 것이 성능 향상 또는 하드웨어 구성의 간결성에 유리하기 때문에, 본 발명에 따른 CNN 시스템은 하드웨어 배열 규칙성(regularity)를 유지하면서도 효과적으로 하드웨어 엔진을 운용할 수 있다.As described above, the CNN system according to the present invention can reduce the number or size of parameters required for operation through neural network compression, and accordingly, required operation can be reduced. At this time, the CNN system according to the present invention can simplify the hardware configuration by using the sparse index associated with the weight. Accordingly, in general, the hardware configuration is implemented in an iso-array, and since it is advantageous for performance improvement or simplicity of hardware configuration to be operated to have repeatability, the CNN system according to the present invention maintains hardware arrangement regularity while maintaining You can effectively operate the hardware engine.

상술된 내용은 본 발명을 실시하기 위한 구체적인 실시 예들이다. 본 발명은 상술된 실시 예들뿐만 아니라, 단순하게 설계 변경되거나 용이하게 변경할 수 있는 실시 예들 또한 포함할 것이다. 또한, 본 발명은 실시 예들을 이용하여 용이하게 변형하여 실시할 수 있는 기술들도 포함될 것이다. 따라서, 본 발명의 범위는 상술된 실시 예들에 국한되어 정해져서는 안되며 후술하는 특허청구범위뿐만 아니라 이 발명의 특허청구범위와 균등한 것들에 의해 정해져야 할 것이다.The above are specific embodiments for carrying out the present invention. The present invention will include not only the above-described embodiments, but also simple design changes or easily changeable embodiments. In addition, the present invention will include techniques that can be easily modified and implemented using the embodiments. Accordingly, the scope of the present invention should not be limited to the above-described embodiments and should be defined by the claims and equivalents of the claims of the present invention as well as the claims to be described later.

Claims

A data selector configured to output an input value corresponding to the position of the sparse weight among input values of input data based on a sparse index indicating a position of a non-zero value in the sparse weight kernel ; and
a multiply accumulate (MAC) operator configured to perform a convolution operation on the input value output from the data selector using the sparse weight kernel;
The sparse weight kernel includes at least one weight value of '0', and the data selector is
switch circuit; and
Includes a plurality of multiplexers (MUX; multiplexer),
the switch circuit is configured to provide each of the input values to the plurality of MUXs based on the sparse weight kernel;
each of the plurality of MUXs is configured to select and output an input value corresponding to a position of the sparse weight among the input values provided by the switch circuit based on the sparse index,
The MAC operator receives each input value output from each of the plurality of MUXs, and a convolution including a plurality of MAC cores configured to respectively perform a convolution operation on the received input value based on the sparse weight kernel Lussion Neural Network System.

The method of claim 1,
and the data selector is configured not to output an input value corresponding to a position of a value of '0' in the sparse weight kernel among input values.

The method of claim 1,
an input buffer device configured to store an input tile that is part of the input data from an external memory; and
and an output buffer device configured to store a result value of the convolution operation from the MAC operator and provide the stored result value to the external memory.

The method of claim 1,
a weight kernel buffer device configured to receive the sparse weight kernel from an external memory, provide the received sparse weight kernel to the MAC operator, and provide the sparse index of the sparse weight kernel to the data selector Convolutional Neural Network System.

delete

The method of claim 1,
Each of the plurality of MAC cores,
a multiplier configured to perform a multiplication operation on the input value and the sparse weight;
an adder configured to perform an addition operation on a result of the multiplication operation and a result of a previous addition operation; and
and a register configured to store a result of the addition operation.

The method of claim 1,
The sparse weight kernel is a weight kernel converted from a full weight kernel through neural network compression,
The full weight kernel is a convolutional neural network system composed of weight values other than '0'.

9. The method of claim 8,
The convolutional neural network system in which the neural network compression is performed based on at least one of a parameter removal technique, a weight sharing technique, and a parameter quantization technique for the full weight kernel.

an input buffer device configured to receive an input tile including a plurality of input values from an external memory and store the plurality of input values of the received input tile;
a data selector configured to output at least one of the plurality of input values from the input buffer device based on a sparse index indicating a location of a sparse weight other than '0' in a sparse weight kernel;
a multiply-accumulate (MAC) operator configured to perform a convolution operation based on the sparse weight and the at least one input value output from the data selector; and
an output buffer device configured to store a result value of the convolution operation from the MAC operator and provide the stored result value to the external memory as an output tile;
The data selector is
switch circuit; and
Includes a plurality of multiplexers (MUX; multiplexer),
the switch circuit is configured to connect each of the plurality of input values to each of the plurality of MUXs based on the size of the input tile and the sparse weight kernel;
The convolutional neural network system, wherein each of the plurality of MUXs is configured to select and output the at least one input value corresponding to the location of the sparse weight among the connected input values, based on the sparse index.

delete

11. The method of claim 10,
A convolutional neural network system in which each of the plurality of MUXs does not output an input value corresponding to a position of a weight of '0' in the sparse weight kernel.

11. The method of claim 10,
The at least one input value from each of the plurality of MUXs is an input value corresponding to the location of the sparse weight.

11. The method of claim 10,
When the sparse weight kernel has a size of K×K (where K is a natural number), the switch circuit is configured to connect 2K input values to each of the plurality of MUXs.

11. The method of claim 10,
and the MAC operator includes a plurality of MAC cores, each configured to perform the convolution operation based on the sparse weight kernel and the at least one input value from each of the plurality of MUXs.

delete