KR102336295B1

KR102336295B1 - Convolutional neural network system using adaptive pruning and weight sharing and operation method thererof

Info

Publication number: KR102336295B1
Application number: KR1020170027951A
Authority: KR
Inventors: 김진규; 이주현
Original assignee: 한국전자통신연구원
Priority date: 2016-10-04
Filing date: 2017-03-03
Publication date: 2021-12-09
Also published as: KR20180037558A

Abstract

본 발명은 파라미터의 수를 줄이고, 계산량을 줄일 수 있는 신경망 시스템 및 그것의 동작 방법에 관한 것이다. 본 발명의 컨볼루션 신경망 시스템의 동작 방법은, 입력 데이터를 사용하여 신경망 노드들 간의 가중치들에 대한 학습을 수행하는 단계, 상기 가중치들 중에서 문턱값보다 작은 크기를 갖는 가중치를 제거한 후에 상기 입력 데이터를 사용하는 학습을 수행하는 적응형 파라미터 제거 단계, 그리고 상기 적응형 파라미터 제거 단계에서 생존한 가중치들을 복수의 대표값에 맵핑시키는 적응형 가중치 공유 단계를 포함한다. The present invention relates to a neural network system capable of reducing the number of parameters and reducing the amount of computation, and an operating method thereof. The method of operating a convolutional neural network system of the present invention includes the steps of: performing learning on weights between neural network nodes using input data, removing a weight having a size smaller than a threshold value from among the weights, and then receiving the input data an adaptive parameter removal step of performing learning to use; and an adaptive weight sharing step of mapping weights surviving in the adaptive parameter removal step to a plurality of representative values.

Description

Convolutional neural network system using adaptive pruning and weight sharing and method of operation thereof

본 발명은 신경망 시스템에 관한 것으로, 더 상세하게는 적응적으로 파라미터 수를 줄일 수 있는 컨볼루션 신경망 시스템 및 그것의 동작 방법에 관한 것이다. The present invention relates to a neural network system, and more particularly, to a convolutional neural network system capable of adaptively reducing the number of parameters and an operating method thereof.

최근 영상 인식을 위한 기술로 심층 신경망(Deep Neural Network) 기법의 하나인 컨볼루션 신경망(Convolutional Neural Network: 이하, CNN)이 활발하게 사용되고 있다. 신경망 구조는 사물 인식이나 필기체 인식 등 다양한 객체 인지 분야에서 뛰어난 성능을 보이고 있다. 특히, 컨볼루션 신경망(CNN)은 객체 인식에 매우 효과적인 성능을 제공하고 있다. Recently, as a technology for image recognition, a convolutional neural network (CNN), which is one of deep neural network techniques, is being actively used. The neural network structure shows excellent performance in various object recognition fields such as object recognition and handwriting recognition. In particular, convolutional neural networks (CNNs) provide very effective performance for object recognition.

컨볼루션 신경망에서 사용되는 파라미터의 수는 매우 많으며, 노드들 간 연결 수가 매우 많기 때문에 연산에 필요한 메모리의 사이즈가 커야 한다. 더불어, 컨볼루션 신경망은 실질적으로 높은 대역폭의 메모리를 요구하기 때문에, 임베디드 시스템이나 모바일 시스템에서 구현하기는 용이하지 않다. 또한, 컨볼루션 신경망은 빠른 처리를 위해 높은 계산량을 요구하기 때문에 내부 연산기의 규모가 커지는 단점을 갖는다.The number of parameters used in a convolutional neural network is very large, and since the number of connections between nodes is very large, the size of memory required for operation must be large. In addition, since the convolutional neural network substantially requires a high bandwidth memory, it is not easy to implement in an embedded system or a mobile system. In addition, the convolutional neural network has a disadvantage in that the size of the internal operator increases because it requires a high amount of computation for fast processing.

따라서, 이러한 신경망 알고리즘의 연산 복잡도를 줄이고 인식 시간을 단축하기 위해, 신경망 시스템에서 사용되는 파라미터의 수를 줄이는 방법이 절실한 실정이다. Therefore, in order to reduce the computational complexity of the neural network algorithm and shorten the recognition time, there is an urgent need for a method of reducing the number of parameters used in the neural network system.

본 발명의 목적은 컨볼루션 신경망에서 사용되는 파라미터 수를 줄일 수 있는 컨볼루션 신경망 시스템 및 그것의 동작 방법을 제공하는데 있다. An object of the present invention is to provide a convolutional neural network system capable of reducing the number of parameters used in a convolutional neural network, and an operating method thereof.

본 발명의 실시 예에 따른 컨볼루션 신경망 시스템의 동작 방법은, 입력 데이터를 사용하여 신경망 노드들 간의 가중치들에 대한 학습을 수행하는 단계, 상기 가중치들 중에서 문턱값보다 작은 크기를 갖는 가중치를 제거한 후에 상기 입력 데이터를 사용하는 학습을 수행하는 적응형 파라미터 제거 단계, 그리고 상기 적응형 파라미터 제거 단계에서 생존한 가중치들을 복수의 대표값에 맵핑시키는 적응형 가중치 공유 단계를 포함한다. A method of operating a convolutional neural network system according to an embodiment of the present invention includes: performing learning on weights between neural network nodes using input data; after removing a weight having a size smaller than a threshold value from among the weights and an adaptive parameter removal step of performing learning using the input data, and an adaptive weight sharing step of mapping weights surviving in the adaptive parameter removal step to a plurality of representative values.

본 발명의 실시 예에 따른 컨볼루션 신경망 시스템은, 입력 데이터를 버퍼링하는 입력 버퍼, 복수의 신경망 노드들 간의 파라미터를 상기 입력 데이터를 사용하여 학습시키는 연산 유닛, 상기 연산 유닛의 학습 결과를 저장하고 업데이트하는 출력 버퍼, 상기 복수의 신경망 노드들 간의 파라미터를 상기 연산 유닛에 전달하고, 상기 학습의 결과에 따라 상기 파라미터를 업데이트하는 파라미터 버퍼, 그리고 상기 신경망 노드들 간의 가중치들 중에서 문턱값보다 작은 크기의 가중치들을 제거하도록 상기 파라미터 버퍼를 제어하고, 상기 가중치들 중에서 생존 가중치들을 적어도 하나의 대표값에 맵핑시키는 제어 유닛을 포함한다. A convolutional neural network system according to an embodiment of the present invention includes an input buffer for buffering input data, an operation unit for learning parameters between a plurality of neural network nodes using the input data, and storing and updating the learning result of the operation unit. an output buffer that transmits the parameters between the plurality of neural network nodes to the operation unit, and updates the parameters according to the learning result, and a weight having a size smaller than a threshold among the weights between the neural network nodes. and a control unit that controls the parameter buffer to remove the values, and maps survival weights among the weights to at least one representative value.

본 발명의 실시 예들에 따르면, 본 발명의 신경망 시스템은 적은 파라미터를 사용하여 인식 정확도의 감쇄 효과가 없이 동일한 인식율을 갖는 출력을 제공할 수 있다. 또한, 본 발명의 적응형 파라미터 제거 기법과 적응형 대표 가중치 공유기법을 사용하는 신경망 시스템은 파라미터를 수백 배 이상 압축된 사이즈로 사용 가능하기 때문에 모바일 단말에서도 딥러닝 네트워크를 이용한 사물 인식을 가능케 한다. 더불어, 본 발명의 신경망 시스템은 인식당 소모되는 에너지 측면에서도 매우 유리하여 컨볼루션 신경망 시스템의 구동에 소요되는 전력을 획기적으로 감소시킬 수 있다.According to embodiments of the present invention, the neural network system of the present invention may provide an output having the same recognition rate without attenuating the recognition accuracy by using a small number of parameters. In addition, the neural network system using the adaptive parameter removal technique and the adaptive representative weight sharing technique of the present invention enables object recognition using a deep learning network in a mobile terminal because parameters can be used in a compressed size of hundreds of times or more. In addition, the neural network system of the present invention is very advantageous in terms of energy consumed per recognition, so that the power required to drive the convolutional neural network system can be dramatically reduced.

도 1은 본 발명의 실시 예에 따른 컨볼루션 신경망 시스템을 보여주는 블록도이다.
도 2a 및 도 2b는 예시적인 컨볼루션 신경망의 연산 절차 및 파라미터의 수를 보여주는 도면들이다.
도 3은 본 발명의 실시 예에 따른 파라미터를 줄이기 위한 컨볼루션 신경망의 동작 방법을 간략히 보여주는 순서도이다.
도 4는 도 3에서 설명된 본 발명의 신경망의 파라미터를 줄이는 방법을 도식적으로 보여주는 도면이다.
도 5는 본 발명의 실시 예에 따른 적응형 파라미터 제거 기법을 간략히 보여주는 순서도이다.
도 6은 도 5의 적응형 파라미터 제거 기법을 적용하는 경우에 파라미터들의 확률 분포를 단계별로 보여주는 도면이다.
도 7은 본 발명의 실시 예에 따른 적응형 가중치 공유 기법을 보여주는 도면이다.
도 8은 본 발명의 효과를 예시적으로 보여주는 표이다.1 is a block diagram showing a convolutional neural network system according to an embodiment of the present invention.
2A and 2B are diagrams showing a computational procedure and number of parameters of an exemplary convolutional neural network.
3 is a flowchart schematically illustrating a method of operating a convolutional neural network for reducing parameters according to an embodiment of the present invention.
FIG. 4 is a diagram schematically illustrating a method of reducing parameters of the neural network of the present invention described in FIG. 3 .
5 is a flowchart briefly illustrating an adaptive parameter removal technique according to an embodiment of the present invention.
FIG. 6 is a diagram showing the probability distribution of parameters in stages when the adaptive parameter removal technique of FIG. 5 is applied.
7 is a diagram illustrating an adaptive weight sharing technique according to an embodiment of the present invention.
8 is a table exemplarily showing the effects of the present invention.

일반적으로, 컨볼루션(Convolution) 연산은 두 함수 간의 상관관계를 검출하기 위한 연산이다. '컨볼루션 신경망(Convolutional Neural Network: CNN)'라는 용어는 특정 피처(Feature)를 지시하는 커널(Kernel)과의 컨볼루션 연산을 수행하고, 연산의 결과를 반복하여 이미지의 패턴을 결정하는 과정 또는 시스템을 통칭할 수 있다. In general, a convolution operation is an operation for detecting a correlation between two functions. The term 'Convolutional Neural Network (CNN)' refers to the process of performing a convolution operation with a kernel indicating a specific feature, and repeating the result of the operation to determine the pattern of the image or system can be called.

아래에서는, 본 발명의 기술 분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있을 정도로, 본 발명의 실시 예들이 명확하고 상세하게 기재된다.Hereinafter, embodiments of the present invention will be described clearly and in detail to the extent that those skilled in the art can easily practice the present invention.

도 1은 본 발명의 실시 예에 따른 컨볼루션 신경망 시스템을 보여주는 블록도이다. 도 1을 참조하면, 컨볼루션 신경망 시스템(100)은 외부 메모리(120)로부터 제공되는 입력 이미지를 처리하여 출력값을 생성할 수 있다.1 is a block diagram showing a convolutional neural network system according to an embodiment of the present invention. Referring to FIG. 1 , the convolutional neural network system 100 may generate an output value by processing an input image provided from the external memory 120 .

입력 이미지는 이미지 센서를 통해서 제공되는 정지 영상이나 동영상일 수 있다. 입력 이미지는 유무선 통신 수단을 통해서 전달된 영상일 수도 있을 것이다. 입력 이미지는 디지털화된 이미지 데이터의 2차원 어레이를 나타낼 수 있다. 입력 이미지는 컨볼루션 신경망 시스템(100)의 트레이닝(Training)을 위해 제공되는 샘플 영상들일 수도 있을 것이다. 출력값은 컨볼루션 신경망 시스템(100)이 입력 이미지를 처리한 결과 값이다. 출력값은 컨볼루션 신경망 시스템(100)의 학습 동작이나 추정 동작에서 입력된 영상에 대한 판단 결과값들이다. 출력값은 컨볼루션 신경망 시스템(100)이 입력 이미지의 내부에 포함된 것으로 검출한 패턴이나 식별 정보일 수 있다.The input image may be a still image or a moving image provided through an image sensor. The input image may be an image transmitted through a wired/wireless communication means. The input image may represent a two-dimensional array of digitized image data. The input images may be sample images provided for training of the convolutional neural network system 100 . The output value is a result value of the convolutional neural network system 100 processing the input image. The output values are judgment result values for an image input in a learning operation or an estimation operation of the convolutional neural network system 100 . The output value may be a pattern or identification information detected by the convolutional neural network system 100 as being included in the input image.

컨볼루션 신경망 시스템(100)은 입력 버퍼(110), 연산 유닛(130), 파라미터 버퍼(150), 출력 버퍼(170), 제어 유닛(190)을 포함할 수 있다. The convolutional neural network system 100 may include an input buffer 110 , an operation unit 130 , a parameter buffer 150 , an output buffer 170 , and a control unit 190 .

입력 버퍼(110)에는 입력 이미지의 데이터 값들이 로드된다. 입력 버퍼(110)의 사이즈는 컨볼루션 연산을 위한 커널(Kernel)의 사이즈에 따라 가변될 수 있을 것이다. 예를 들면, 커널의 사이즈가 K×K인 경우, 입력 버퍼(110)에는 연산 유닛(130)에 의한 커널과의 컨볼루션 연산(또는, 커널링)을 순차적으로 수행하기 위한 충분한 크기의 입력 데이터가 로드되어야 할 것이다. 입력 데이터의 입력 버퍼(110)로의 로드는 제어 유닛(190)에 의해서 제어될 수 있을 것이다. The input buffer 110 is loaded with data values of the input image. The size of the input buffer 110 may vary according to the size of a kernel for a convolution operation. For example, when the size of the kernel is K×K, the input buffer 110 contains input data of sufficient size for sequentially performing a convolution operation (or kerneling) with the kernel by the operation unit 130 . will have to be loaded. The loading of input data into the input buffer 110 may be controlled by the control unit 190 .

연산 유닛(130)은 입력 버퍼(110), 파라미터 버퍼(150), 그리고 출력 버퍼(170)를 사용하여 컨볼루션 연산이나 풀링 연산 등을 수행한다. 연산 유닛(130)은 예를 들면, 입력 이미지에 대한 커널(Kernel)과의 곱셈 및 가산을 반복적으로 처리하는 커널링(Kernelling)을 수행할 수 있다. 연산 유닛(130)은 복수의 커널링이나 풀링 연산을 병렬로 처리하기 위한 병렬 처리 코어들을 포함할 수 있을 것이다. The operation unit 130 performs a convolution operation or a pooling operation using the input buffer 110 , the parameter buffer 150 , and the output buffer 170 . The operation unit 130 may, for example, perform kerneling in which multiplication and addition of an input image with a kernel are repeatedly processed. The operation unit 130 may include parallel processing cores for processing a plurality of kerneling or pooling operations in parallel.

커널(Kernel)은 예를 들면 입력 버퍼(110) 또는 파라미터 버퍼(150)로부터 제공될 수 있다. 커널과 입력 이미지의 중첩 위치의 제반 데이터를 곱하고, 그 결과들을 합하는 과정을 이하에서는 커널링(Kernelling)이라 칭하기로 한다. 커널들 각각은 특정 피처 식별자(Feature Identifier)로 간주될 수 있다. 이러한 커널링은 입력 이미지와 다양한 피처 식별자에 대응하는 커널들에 대해서 수행될 것이다. 이러한 커널링이 제반 커널들에 의해서 수행되는 절차는 컨볼루션 계층(Convolution layer)에서 수행될 수 있고, 그 결과값으로 복수의 채널들에 대응하는 피처 맵(Feature Map)들이 생성될 수 있다. The kernel may be provided from, for example, the input buffer 110 or the parameter buffer 150 . The process of multiplying all data of the overlapping positions of the kernel and the input image and summing the results will be referred to as kernelling hereinafter. Each of the kernels may be regarded as a specific feature identifier (Feature Identifier). This kerneling will be performed on the kernels corresponding to the input image and various feature identifiers. A procedure in which such kerneling is performed by all kernels may be performed in a convolution layer, and as a result, feature maps corresponding to a plurality of channels may be generated.

연산 유닛(130)은 컨볼루션 계층에 의해서 생성된 피처 맵들을 다운 샘플링으로 처리할 수 있다. 컨볼루션 연산에 의해서 생성된 피처 맵들의 사이즈가 상대적으로 크기 때문에, 연산 유닛(130)은 풀링(Pooling)을 수행하여 피처 맵들의 사이즈를 줄일 수 있다. 각각의 커널링이나 풀링 연산의 결과 값은 출력 버퍼(170)에 저장되고, 컨볼루션 루프 수가 증가할 때마다, 그리고 풀링 연산이 발생할 때마다 업데이트될 수 있다.The calculation unit 130 may process the feature maps generated by the convolutional layer as down-sampling. Since the size of the feature maps generated by the convolution operation is relatively large, the operation unit 130 may reduce the size of the feature maps by performing pooling. The result value of each kerneling or pooling operation is stored in the output buffer 170 and may be updated whenever the number of convolution loops increases and whenever a pooling operation occurs.

파라미터 버퍼(150)에는 연산 유닛(130)에서 수행되는 커널링, 바이어스(Bias) 가산, 활성화(Relu), 풀링(Pooling) 등에 필요한 파라미터들을 제공한다. 그리고 학습 단계에서 학습된 파라미터들이 파라미터 버퍼(150)에 저장될 수도 있다.The parameter buffer 150 provides parameters necessary for kerneling, bias addition, activation (Relu), pooling, etc. performed by the operation unit 130 . In addition, parameters learned in the learning step may be stored in the parameter buffer 150 .

출력 버퍼(170)에는 연산 유닛(130)에 의해서 실행되는 커널링이나 풀링의 결과값이 로드된다. 출력 버퍼(170)에 로드된 결과값은 복수의 커널들에 의한 각 컨볼루션 루프의 실행 결과에 따라 업데이트된다. The output buffer 170 is loaded with a result value of kernel ringing or pooling executed by the operation unit 130 . The result value loaded into the output buffer 170 is updated according to the execution result of each convolution loop by the plurality of kernels.

제어 유닛(190)은 본 발명의 실시 예에 따른 컨볼루션 연산, 풀링 연산, 활성화 연산 등을 수행하도록 연산 유닛(130)을 제어할 수 있다. 제어 유닛(190)은 입력 이미지 또는 피처 맵(Feature Map)과 커널을 사용한 컨볼루션 연산을 수행할 수 있다. 제어 유닛(190)은 학습 연산이나 실제 런타임 연산에서 저가중치의 파라미터들을 제거하는 적응적 파라미터 제거 연산을 수행하도록 연산 유닛(130)을 제어할 수 있다. 더불어, 제어 유닛(190)은 적응형 파라미터 제거 연산을 통해서 살아남은 가중치들 중에서 레이어별로 동일하거나 비슷한 값의 파라미터를 대표 파라미터로 맵핑할 수 있다. 이러한 레이어별로 동일하거나 유사한 파라미터들을 대표 파라미터로 공유하는 경우, 외부 메모리(120)와의 데이터 교환을 위한 대역폭 요구량을 대폭 줄일 수 있다. The control unit 190 may control the operation unit 130 to perform a convolution operation, a pooling operation, an activation operation, and the like according to an embodiment of the present invention. The control unit 190 may perform a convolution operation using an input image or a feature map and a kernel. The control unit 190 may control the operation unit 130 to perform an adaptive parameter removal operation for removing low-weight parameters from a learning operation or an actual runtime operation. In addition, the control unit 190 may map a parameter having the same or similar value for each layer as a representative parameter from among the weights surviving through the adaptive parameter removal operation. When the same or similar parameters for each layer are shared as representative parameters, the bandwidth requirement for data exchange with the external memory 120 can be significantly reduced.

이상에서는 본 발명의 컨볼루션 신경망 시스템(100)의 구성이 예시적으로 설명되었다. 상술한 적응형 파라미터 제거 연산과 적응형 파라미터 공유 연산을 통해서, 컨볼루션 신경망 시스템(100)이 관리해야 하는 파라미터의 수가 획기적으로 감소될 수 있다. 파라미터 수의 감소에 따라 컨볼루션 신경망 시스템(100)을 구성하기 위해 요구되는 메모리 사이즈나 메모리 채널의 대역폭이 감소될 수 있다. 그리고 메모리 사이즈나 채널 대역폭의 감소는 모바일 장치에서의 컨볼루션 신경망의 하드웨어적인 구현 가능성을 향상시킬 수 있다.In the above, the configuration of the convolutional neural network system 100 of the present invention has been exemplarily described. Through the above-described adaptive parameter removal operation and adaptive parameter sharing operation, the number of parameters to be managed by the convolutional neural network system 100 may be remarkably reduced. As the number of parameters is reduced, a memory size or bandwidth of a memory channel required to configure the convolutional neural network system 100 may be reduced. In addition, reduction in memory size or channel bandwidth may improve the hardware implementation possibility of a convolutional neural network in a mobile device.

도 2a 및 도 2b는 예시적인 컨볼루션 신경망의 연산 절차 및 파라미터의 수를 보여주는 도면들이다. 도 2a는 입력 이미지를 처리하기 위한 컨볼루션 신경망의 레이어(Layer)들을 보여주는 도면이다. 도 2b는 도 2a에 도시된 각각의 레이어별 사용되는 파라미터들의 수를 보여주는 표이다. 2A and 2B are diagrams showing a computational procedure and number of parameters of an exemplary convolutional neural network. 2A is a diagram showing layers of a convolutional neural network for processing an input image. FIG. 2B is a table showing the number of parameters used for each layer shown in FIG. 2A .

도 2a를 참조하면, 학습이나 사물 인식과 같은 동작에서 수행되는 컨볼루션 연산이나 풀링 연산, 그리고 활성화 연산 등에는 매우 많은 수의 파라미터들이 입력되고, 새롭게 생성되고, 업데이트되어야 한다. 입력 이미지(210)는 제 1 컨볼루션 계층(conv1)과 그 결과를 다운 샘플링하기 위한 제 1 풀링 계층(pool1)에 의해서 처리된다. 입력 이미지(210)가 제공되면, 먼저 커널(215)과의 컨볼루션 연산을 수행하는 제 1 컨볼루션 계층(conv1)이 적용된다. 즉, 커널(215)과 중첩되는 입력 이미지(210)의 데이터가 커널(215)에 정의된 데이터와 곱해진다. 그리고 곱해진 모든 값은 합산되어 하나의 피처값으로 생성되고, 제 1 피처 맵(220)의 한 포인트를 구성하게 될 것이다. 이러한 커널링 연산은 커널(215)이 순차적으로 쉬프트되면서 반복적으로 수행될 것이다. Referring to FIG. 2A , a very large number of parameters must be input, newly created, and updated in a convolution operation, a pooling operation, and an activation operation performed in an operation such as learning or object recognition. The input image 210 is processed by a first convolutional layer conv1 and a first pooling layer for downsampling the result (pool1). When the input image 210 is provided, a first convolutional layer conv1 that performs a convolution operation with the kernel 215 is first applied. That is, data of the input image 210 overlapping the kernel 215 is multiplied by data defined in the kernel 215 . Then, all the multiplied values are summed to generate one feature value, and will constitute one point of the first feature map 220 . This kerneling operation will be repeatedly performed while the kernel 215 is sequentially shifted.

하나의 입력 이미지(210)에 대한 커널링 연산은 복수의 커널들에 대해서 수행된다. 그리고 제 1 컨볼루션 계층(conv1)의 적용에 따라 복수의 채널들 각각에 대응하는 어레이 형태의 제 1 피처 맵(220)이 생성될 수 있을 것이다. 예를 들면, 4개의 커널들을 사용하면, 4개의 어레이 또는 채널로 구성된 제 1 피처 맵(220)이 생성될 수 있을 것이다. 하지만, 입력 이미지(210)가 3차원 이미지(3-dimension)인 경우, 피처 맵들의 수는 급격히 증가하고, 컨볼루션 루프의 반복 회수인 깊이(Depth)도 급격히 증가할 수 있다. A kerneling operation on one input image 210 is performed on a plurality of kernels. In addition, according to the application of the first convolutional layer conv1 , the first feature map 220 in the form of an array corresponding to each of the plurality of channels may be generated. For example, if 4 kernels are used, the first feature map 220 composed of 4 arrays or channels may be generated. However, when the input image 210 is a three-dimensional image, the number of feature maps may rapidly increase, and the depth, which is the number of iterations of the convolution loop, may also rapidly increase.

이어서, 제 1 컨볼루션 계층(conv1)의 실행이 완료되면, 제 1 피처 맵(220)의 사이즈를 줄이기 위한 다운 샘플링(Down-sampling)이 수행된다. 제 1 피처 맵(220)의 데이터는 커널의 수나 입력 이미지(210)의 사이즈에 따라 처리의 부담이 되는 사이즈일 수 있다. 따라서, 제 1 풀링 계층(pool1)에서는 연산 결과에 크게 영향을 주지 않는 범위에서 제 1 피처 맵(220)의 사이즈를 줄이기 위한 다운 샘플링(또는, 서브-샘플링)이 수행된다. 다운 샘플링의 대표적인 연산 방식이 풀링(Pooling)이다. 다운 샘플링을 위한 필터를 제 1 피처 맵(220)에 미리 결정된 스트라이드(Stride)로 슬라이딩시키면서, 해당 영역에서의 최대값 또는 평균값이 선택될 수 있다. 최대값을 선택하는 경우를 최대값 풀링(Max Pooling)이라 하고, 평균값을 출력하는 방식을 평균값 풀링(Average Pooling)이라 한다. 풀링 계층(pool1)에 의해서 제 1 피처 맵(220)은 감소된 사이즈의 제 2 피처 맵(230)으로 생성된다.Subsequently, when the execution of the first convolutional layer conv1 is completed, down-sampling is performed to reduce the size of the first feature map 220 . The data of the first feature map 220 may have a size that becomes a burden for processing according to the number of kernels or the size of the input image 210 . Accordingly, down-sampling (or sub-sampling) for reducing the size of the first feature map 220 is performed in the first pooling layer pool1 in a range that does not significantly affect the operation result. A typical calculation method of downsampling is pooling. While sliding the filter for downsampling in a predetermined stride on the first feature map 220 , a maximum value or an average value in the corresponding region may be selected. The case of selecting the maximum value is called max pooling, and the method of outputting the average value is called average pooling. The first feature map 220 is generated as a second feature map 230 with a reduced size by the pooling layer (pool1).

컨볼루션 연산이 수행되는 컨볼루션 계층과 다운 샘플링 연산이 수행되는 풀링 계층은 필요에 따라 반복될 수 있다. 즉, 도시된 바와 같이 제 2 컨볼루션 계층(conv2) 및 제 2 풀링 계층(pool2)이 수행될 수 있다. 각각 제 2 컨볼루션 계층(conv2)을 통해서 제 3 피처 맵(240)이 생성되고, 제 2 풀링 계층(pool2)에 의해서 제 4 피처 맵(250)이 생성될 수 있을 것이다. 그리고 제 4 피처 맵(250)은 완전 연결망 동작(ip1, ip2)과 활성화 계층(Relu)의 처리를 통해서 각각 완전 연결 계층들(260, 270) 및 출력 계층(280)으로 생성된다. 완전 연결망 동작(ip1, ip2)과 활성화 계층(Relu)에서는 커널은 사용되지 않는다. 물론, 도시되지는 않았지만, 컨볼루션 계층과 풀링 계층 사이에 바이어스 가산이나 활성화 연산이 추가될 수 있을 것이다.The convolutional layer on which the convolution operation is performed and the pooling layer on which the downsampling operation is performed may be repeated as needed. That is, as shown, the second convolution layer (conv2) and the second pooling layer (pool2) may be performed. The third feature map 240 may be generated through the second convolutional layer conv2, respectively, and the fourth feature map 250 may be generated by the second pooling layer pool2. In addition, the fourth feature map 250 is generated as fully connected layers 260 and 270 and an output layer 280 through the processing of the fully connected network operations ip1 and ip2 and the activation layer Relu, respectively. The kernel is not used for fully connected network operation (ip1, ip2) and activation layer (Relu). Of course, although not shown, a bias addition or activation operation may be added between the convolutional layer and the pooling layer.

도 2a 및 도 2b를 참조하면, 28×28 픽셀 크기를 갖는 입력 이미지가 입력되면, 제 1 컨볼루션 계층(conv1)에서는 5×5 크기의 커널(215)을 사용하는 컨볼루션 연산이 수행된다. 입력 이미지의 에지(Edge) 부분에서의 패딩(Padding) 없이 컨볼루션 연산이 수행되기 때문에 20개 출력 채널의 24×24 크기를 갖는 제 1 피처 맵(220)이 출력된다. 출력 채널 수 20은 제 1 컨볼루션 계층(conv1)에서 사용되는 커널(215)의 수에 의해서 결정되는 채널 수이다. 그리고 바이어스는 각 채널들 사이에 부가되는 값으로, 채널의 수에 대응한다. 2A and 2B , when an input image having a size of 28×28 pixels is input, a convolution operation using a kernel 215 having a size of 5×5 is performed in the first convolution layer conv1 . Since the convolution operation is performed without padding at the edge of the input image, the first feature map 220 having a size of 24×24 of 20 output channels is output. The number of output channels 20 is the number of channels determined by the number of kernels 215 used in the first convolutional layer conv1. And the bias is a value added between the respective channels, and corresponds to the number of channels.

상술한 조건에서, 제 1 컨볼루션 계층(conv1)에서 사용되는 파라미터의 수(또는, 가중치 수)는 출력 채널 수(20)와 입력 채널 수(1), 그리고 커널의 크기(5×5)를 곱한 값(500)이 된다. 그리고 제 1 컨볼루션 계층(conv1)에서의 연결 수는 출력되는 제 1 피처 맵의 사이즈(24×24)와 파라미터의 수(500+20)를 곱한 값(299,520)으로 생성된다. 제 1 풀링 계층(pool1)에서는 공간(spatial) 도메인 상에서 채널 수는 유지하면서 채널의 폭(width)과 높이(height)를 조정하는 것이다. 풀링(pooling) 동작은 공간 도메인 상에서 이미지의 특징적 데이터를 근사화하는 효과가 있다. Under the above conditions, the number of parameters (or weights) used in the first convolutional layer (conv1) is the number of output channels (20), the number of input channels (1), and the size of the kernel (5×5). The multiplied value is 500. In addition, the number of connections in the first convolutional layer conv1 is generated by multiplying the size (24×24) of the output first feature map by the number of parameters (500+20) (299,520). In the first pooling layer (pool1), the width and height of a channel are adjusted while maintaining the number of channels in the spatial domain. The pooling operation has an effect of approximating the characteristic data of the image in the spatial domain.

제 2 컨볼루션 계층(conv2)과 제 2 풀링 계층(pool2) 각각의 동작은 제 1 컨볼루션 계층(conv1)과 제 1 풀링 계층(pool1)에 비해 채널 수와 커널 크기만 다를 뿐 동일한 동작을 수행한다. 제 2 컨볼루션 계층(conv2)에서 사용되는 파라미터의 수(또는, 가중치 수)는 출력 채널 수(50)와 입력 채널 수(20), 그리고 커널의 크기(5×5)를 곱한 값(25000)이 된다. 그리고 제 2 컨볼루션 계층(conv2)에서의 연결 수는 출력되는 제 3 피처 맵(240)의 사이즈(8×8)와 파라미터의 수(25000+50)를 곱한 값(1,603,200)으로 생성된다.Each operation of the second convolutional layer (conv2) and the second pooling layer (pool2) performs the same operation as compared to the first convolutional layer (conv1) and the first pooling layer (pool1), only different in the number of channels and the kernel size. do. The number of parameters (or weights) used in the second convolutional layer (conv2) is a value (25000) multiplied by the number of output channels (50), the number of input channels (20), and the size of the kernel (5×5) becomes this And the number of connections in the second convolutional layer conv2 is generated as a value (1,603,200) multiplied by the size (8×8) of the output third feature map 240 and the number of parameters (25000+50).

완전 연결 계층(ip1, ip2)은 완전 연결망(Fully Connected Networks : FCN) 동작을 수행한다. 완전 연결망 동작에서는 커널이 사용되지 않는다. 전체 입력 노드가 전체 출력 노드는 모든 연결 관계를 유지하고 있다. 따라서, 제 1 완전 연결 계층(ip1)에서의 파라미터 수는 상당히 크다(400,500).The fully connected layers (ip1, ip2) perform Fully Connected Networks (FCN) operations. In full network operation, the kernel is not used. All input nodes and all output nodes maintain all connections. Accordingly, the number of parameters in the first fully connected layer ip1 is quite large (400,500).

본 발명에서는 도 2a 및 도 2b에서 살펴본 바와 같이, 사용되는 파라미터의 수를 줄이면 자연스럽게 연결 수도 줄어들게 되어 연산량이 줄 수 있음을 알 수 있다. 또한, 파라미터는 가중치(weight)와 바이어스(bias)로 나뉠 수 있는데, 상대적으로 바이어스의 수가 작기 때문에, 가중치를 줄이는 방식을 사용하여 높은 압축 효과를 제공할 수 있다.In the present invention, as shown in FIGS. 2A and 2B , it can be seen that when the number of parameters used is reduced, the number of connections is naturally reduced, thereby reducing the amount of computation. In addition, the parameter can be divided into a weight and a bias. Since the number of biases is relatively small, a method of reducing the weight can be used to provide a high compression effect.

도 3은 본 발명의 실시 예에 따른 파라미터를 줄이기 위한 컨볼루션 신경망의 동작 방법을 간략히 보여주는 순서도이다. 도 3을 참조하면, 본 발명의 동작 방법은 저가중치의 파라미터를 제거하는 동작과 가중치 공유 기법을 적용하여 높은 파라미터 감소 효과를 제공할 수 있다. 3 is a flowchart schematically illustrating a method of operating a convolutional neural network for reducing parameters according to an embodiment of the present invention. Referring to FIG. 3 , the operation method of the present invention can provide a high parameter reduction effect by applying an operation of removing a low-weight parameter and a weight sharing technique.

S110 단계에서, 입력에 대한 오리지널 신경망 학습이 수행된다. 즉, 모든 노드가 존재하는 상태에서 입력들을 사용하여 신경망을 학습시킨다. 그러면 노드들 간의 모든 연결에 대한 학습된 파라미터를 구할 수 있다. 이 상태에서 학습된 파라미터의 분포를 조사해보면 정규 분포와 비슷한 형태를 갖는다. In step S110, the original neural network learning for the input is performed. That is, a neural network is trained using inputs in a state where all nodes exist. Then, we can obtain the learned parameters for all connections between nodes. Examining the distribution of the learned parameters in this state, it has a shape similar to that of a normal distribution.

S120 단계에서, 학습된 신경망의 파라미터들에 대한 적응형 파라미터 제거 기법이 적용된다. 적응형 파라미터 제거 기법은 크게 3가지 단계를 거친다. 제 1 단계에서, 신경망의 모든 계층별로 초기 문턱값이 계산된다. 그리고 이어지는 제 2 단계에서는 제 1 단계에서 계산된 초기 문턱값을 시작으로 반복적인 학습을 진행하면서 조금씩 파라미터를 제거한다. 반복적인 학습을 통해 파라미터가 계속 제거되면, 어느 순간에는 문턱값보다 낮은 파라미터는 발생하지 않게 된다. 이때에는 제 3 단계로 넘어간다. 제 3 단계에서는 신경망 압축 효율을 더 높이기 위해 문턱값을 상향 조정한다. 상술한 제 2 단계 및 제 3 단계를 반복적으로 사용하면 저가중치를 갖는 파라미터들은 자연스럽게 제거되면서 학습된다. 따라서, 최종적으로 신경망에서 꼭 필요한 파라미터만이 존재하게 된다. In step S120, an adaptive parameter removal technique is applied to the parameters of the learned neural network. The adaptive parameter removal technique mainly goes through three steps. In the first step, an initial threshold is calculated for every layer of the neural network. In the second step that follows, parameters are removed little by little while repeatedly learning from the initial threshold calculated in the first step. If parameters are continuously removed through iterative learning, parameters lower than the threshold do not occur at any moment. In this case, proceed to step 3. In the third step, the threshold is adjusted upward to further increase the neural network compression efficiency. If the above-described second and third steps are repeatedly used, parameters having low weights are naturally removed and learned. Therefore, only the parameters that are ultimately necessary for the neural network exist.

S130 단계에서, 저가중치가 제거된 파라미터들에 대한 적응형 대표 가중치 공유 기법이 적용된다. 적응형 대표 가중치 공유 기법은 신경망 내에서 각각의 계층별로 동일하거나 비슷한 파라미터를 단일 대표값을 갖는 파라미터로 맵핑하여 공유하는 방법이다. 파라미터 공유 기법은 도 7에서 상세히 설명될 것이다.In step S130, an adaptive representative weight sharing technique is applied to the parameters from which the low weight has been removed. The adaptive representative weight sharing technique is a method of mapping and sharing the same or similar parameters for each layer to a parameter having a single representative value in the neural network. The parameter sharing technique will be described in detail in FIG. 7 .

S140 단계에서, 적응형 대표 가중치 공유 기법에 의해서 처리된 파라미터의 신경망을 재학습시키다. 재학습에 의해서 신경망의 대표 가중치는 높은 정확도를 갖도록 미세 조정될 수 있다. In step S140, the neural network of parameters processed by the adaptive representative weight sharing technique is retrained. By re-learning, the representative weights of the neural network can be fine-tuned to have high accuracy.

이상에서는 본 발명의 실시 예에 따른 적응형 파라미터 제거 기법 및 적응형 대표 가중치 공유 기법을 사용하는 신경망의 학습 과정이 설명되었다. 적응형 파라미터 제거 기법 및 적응형 대표 가중치 공유 기법을 사용하면, 신경망에 필요한 파라미터의 수를 획기적으로 줄일 수 있다. 그리고 파라미터 수의 감소에 따라 신경망 연산의 복잡도와 계산량이 감소된다. 따라서, 신경망 연산에 요구되는 메모리의 사이즈와 대역폭이 대폭 줄어들 수 있다.The learning process of a neural network using the adaptive parameter removal technique and the adaptive representative weight sharing technique according to an embodiment of the present invention has been described above. By using the adaptive parameter removal technique and the adaptive representative weight sharing technique, the number of parameters required for the neural network can be drastically reduced. In addition, as the number of parameters decreases, the complexity and amount of computation of neural network operations are reduced. Accordingly, the size and bandwidth of memory required for neural network operation can be significantly reduced.

도 4는 도 3에서 설명된 본 발명의 신경망의 파라미터를 줄이는 방법을 도식적으로 보여주는 도면이다. 도 4를 참조하면, 본 발명의 컨볼루션 신경망 시스템(100, 도 1 참조)은 적응형 파라미터 제거 기법 및 적응형 대표 가중치 공유 기법을 수행하여 파라미터들의 수가 획기적으로 감소된 신경망을 구성할 수 있다.FIG. 4 is a diagram schematically illustrating a method of reducing parameters of the neural network of the present invention described in FIG. 3 . Referring to FIG. 4 , the convolutional neural network system 100 (refer to FIG. 1 ) of the present invention may configure a neural network with a remarkably reduced number of parameters by performing an adaptive parameter removal technique and an adaptive representative weight sharing technique.

(a)는 가중치의 제거나 공유를 적용하기 이전의 오리지널 신경망을 보여준다. 즉, 일차적으로 신경망의 모든 노드가 존재하는 상태에서 신경망 학습이 진행되어야 한다. 오리지널 신경망은 예시적으로 12개의 노드들(N1~N16) 각각의 연결 관계가 도시되어 있다. 노드들(N1~N4)과 노드들(N5~N8) 사이에는 노드별 각각 4개의 가중치로 나타낼 수 있는 연결들이 존재한다. 마찬가지로, 노드들(N5~N8)과 노드들(N9~N12) 사이에도 각각 4개의 가중치로 나타낼 수 있는 연결을 갖도록 신경망이 구성될 수 있다. 이러한 오리지널 신경망의 노드들 간의 가중치들은 학습된 파라미터로 생성될 것이다. 이 상태에서 학습된 파라미터의 분포는 정규 분포와 유사한 형태를 갖는 것으로 알려져 있다.(a) shows the original neural network before weight removal or sharing is applied. That is, the neural network learning should proceed in a state where all nodes of the neural network exist. In the original neural network, the connection relationship of each of the 12 nodes N1 to N16 is illustrated by way of example. There are connections between the nodes N1 to N4 and the nodes N5 to N8, each of which can be represented by four weights for each node. Similarly, the neural network may be configured to have connections between the nodes N5 to N8 and the nodes N9 to N12, respectively, which can be represented by four weights. Weights between nodes of this original neural network will be generated as learned parameters. It is known that the distribution of the learned parameters in this state has a shape similar to that of a normal distribution.

(a)의 오리지널 신경망에 대한 저가중치의 파라미터를 제거하기 위한 적응형 파라미터 제거 기법이 적용될 것이다. 이러한 절차는 식별번호 ①에 나타내었다. 저가중치 제거 기법으로 본 발명에서 사용하는 적응형 파라미터 제거 기법은 다음과 같다. 즉, 각각의 계층별로 초기 문턱값이 생성된다. 그리고 생성된 초기 문턱값을 시작으로 반복적인 학습을 실시한다. 학습 이전에는 각 계층별로 초기 문턱값보다 높은 가중치가 학습 이후에는 초기 문턱값보다 낮아지는 경우가 발생할 것이다. 이때, 초기 문턱값보다 낮은 가중치로 학습된 파라미터는 제거한다. 한번 제거된 노드간 연결은 다시 복원하지 않은 체로 학습은 반복된다. 이러한 초기 문턱값을 적용한 반복적인 학습을 통해서 생성된 (b)의 축소된 신경망이 생성될 것이다. An adaptive parameter removal technique for removing low-weight parameters for the original neural network of (a) will be applied. This procedure is indicated in the identification number ①. The adaptive parameter removal technique used in the present invention as a low weight removal technique is as follows. That is, an initial threshold value is generated for each layer. Then, iterative learning is performed starting from the generated initial threshold. Before learning, a weight higher than the initial threshold for each layer will be lower than the initial threshold after learning. In this case, the parameter learned with a weight lower than the initial threshold is removed. Once removed, the connection between nodes is not restored, and learning is repeated. The reduced neural network of (b) will be generated through iterative learning with this initial threshold applied.

학습의 반복이 진행됨에 따라 노드들 간의 연결의 가중치들이 더 이상 초기 문턱값 이하로 떨어지지 않는 순간이 발생한다. 이때, 신경망 압축 효율을 높이기 위하여 초기 문턱값보다 더 높은 상향된 문턱값을 적용하여 학습을 반복하면, 신경망에서 반드시 필요한 노드와 가중치들만이 생존하게 된다. 변화된 문턱값이 결정되면 다시 추가적인 신경망 축소를 위해 재프룬(Reprune)이 수행될 것이다. 이러한 과정은 ② 재학습과 ③ 재프룬 루프를 반복하면서 낮은 값을 갖는 가중치들이 제거된다.As the learning iteration proceeds, a moment occurs when the weights of connections between nodes no longer fall below the initial threshold. In this case, if learning is repeated by applying an elevated threshold higher than the initial threshold in order to increase the compression efficiency of the neural network, only nodes and weights that are absolutely necessary in the neural network survive. When the changed threshold is determined, pruning will be performed again to further reduce the neural network. In this process, weights with low values are removed while repeating ② re-learning and ③ japrune loop.

문턱값을 사용한 저가중치의 파라미터들을 제거한 이후에는 대표값을 사용하는 적응형 가중치 공유 기법이 적용된다. 즉, 파라미터 제거 기법의 적용에 의해서 생존한 노드들과 가중치들에 대한 가중치 공유 기법이 적용된다. 컨볼루션 신경망은 가중치 공유의 특성이 존재한다. 이러한 특성을 이용하여, 적응형 파라미터 제거 기법에 의해서 축소된 노드와 가중치들에 대해서 대표값과 유사하거나 동일한 파라미터들을 그룹화하여 관리할 수 있다. 가중치를 공유의 예로, 노드들(N1, N5) 간의 가중치와 노드들(N1, N6)의 가중치가 유사하다면, 이들 연결은 하나의 대표값의 가중치로 맵핑할 수 있다. 더불어, 노드들(N6, N9) 간의 가중치와 노드들(N9, N9)의 가중치가 유사하다면, 이들 연결은 하나의 대표값의 가중치로 맵핑할 수 있을 것이다.After the parameters of the low weight using the threshold are removed, the adaptive weight sharing technique using the representative value is applied. That is, a weight sharing method is applied to nodes and weights that have survived by applying the parameter removal method. Convolutional neural networks have the characteristic of weight sharing. Using this characteristic, parameters similar to or identical to representative values can be grouped and managed for nodes and weights reduced by the adaptive parameter removal technique. As an example of sharing weights, if the weights between the nodes N1 and N5 and the weights of the nodes N1 and N6 are similar, these connections may be mapped to a weight of one representative value. In addition, if the weights between the nodes N6 and N9 and the weights of the nodes N9 and N9 are similar, these connections may be mapped to a weight of one representative value.

상술한 적응형 가중치 공유 기법을 적용한 신경망의 형태는 (d)에 도시되어 있다. 그리고 적응형 가중치 공유 기법에 의해서 생성된 신경망을 ⑤ 재학습 과정을 통해서 처리하면, (e)의 최종 신경망이 구성될 수 있다. 결론적으로, 적응형 파라미터 제거 기법의 적용 이후에는 노드들(N7, N10)과, 노드들(N7, N10)과 관련된 가중치들은 모두 제거될 수 있다. The form of a neural network to which the above-described adaptive weight sharing technique is applied is shown in (d). And if the neural network generated by the adaptive weight sharing technique is processed through the ⑤ re-learning process, the final neural network of (e) can be constructed. Consequently, after the adaptive parameter removal technique is applied, the nodes N7 and N10 and weights related to the nodes N7 and N10 may all be removed.

도 5는 본 발명의 실시 예에 따른 적응형 파라미터 제거 기법을 간략히 보여주는 순서도이다. 도 5를 참조하면, 신경망의 저가중치의 파라미터들은 문턱값의 설정 및 조정을 동반하는 학습 과정의 반복을 통해서 제거될 수 있다. 그리고 적응형 파라미터 제거 기법에 의해 신경망의 특성을 결정하는 반드시 필요한 파라미터들만이 생존할 수 있다. 초기 학습에 의해서 모든 노드에 대한 가중치가 존재하는 오리지널 신경망에 제공된다. 5 is a flowchart briefly illustrating an adaptive parameter removal technique according to an embodiment of the present invention. Referring to FIG. 5 , the parameters of the low weight of the neural network may be removed through repetition of a learning process accompanied by setting and adjusting a threshold value. And only essential parameters that determine the characteristics of the neural network can survive by the adaptive parameter removal technique. By initial learning, the weights for all nodes are provided to the original neural network.

S210 단계에서, 오리지널 신경망의 모든 계층(Layer) 별로 초기 문턱값이 계산된다. 신경망에서는 계층별로 중요도가 다르다. 예를 들면, 입력 이미지의 첫 번째 컨볼루션 계층에서는 가장 초기 특징점(Feature point)이 추출되기 때문에 매우 중요하다. 더불어, 마지막 컨볼루션 계층은 피처값에 대한 확률이 계산되기 때문에 중요하다. 상대적으로 중요한 계층들에 대한 파라미터들의 제거는 신중해야 한다. 따라서, 중요한 계층들의 파라미터를 제거하기에 앞서, 이들 계층들에 대한 민감도를 조사할 필요가 있다. In step S210, an initial threshold is calculated for each layer of the original neural network. In a neural network, each layer has different importance. For example, it is very important because the earliest feature point is extracted from the first convolutional layer of the input image. In addition, the last convolutional layer is important because probabilities for feature values are computed. Removal of parameters for relatively important layers should be prudent. Therefore, before removing the parameters of important layers, it is necessary to investigate the sensitivities of these layers.

신경망에서 각각의 계층들은 나머지 계층들이 그대로 유지되는 상태에서 가중치를 제거하기 위한 문턱값을 조정하면서 민감도를 조사해야 한다. 예를 들어, 첫 번째 컨볼루션 계층의 초기 문턱값을 계산하기 위해서는 나머지 계층들의 노드간 연결을 그대로 유지한 상태에서 문턱값을 조금씩 상향시켜 제거되는 비율에 따른 정확도를 계산한다. 문턱값을 조금씩 올리면 갑자기 정확도가 크게 저하되는 구간이 존재하는데, 이때의 문턱값을 첫 번째 계층의 초기 문턱값으로 결정한다. 나머지 계층들에 대해서도 갖은 방법으로 조금씩 문턱값을 계산하여 초기 문턱값을 구하면 된다.In the neural network, each layer needs to investigate the sensitivity while adjusting the threshold value to remove the weight while the remaining layers are maintained. For example, in order to calculate the initial threshold value of the first convolutional layer, while maintaining the connection between nodes of the remaining layers, the threshold value is gradually increased to calculate the accuracy according to the removal rate. When the threshold is raised little by little, there is a section where the accuracy suddenly drops significantly, and the threshold at this time is determined as the initial threshold of the first layer. For the remaining layers, the initial threshold can be obtained by calculating the thresholds little by little in various ways.

S220 단계에서, S210 단계에서 결정된 초기 문턱값을 시작으로 반복적인 학습을 진행한다. 먼저, 신경망의 각 계층들의 가중치들은 학습의 진행에 따라 문턱값보다 높은 가중치가 문턱값 아래로 내려올 수도 있다. 더불어, 오리지널 신경망의 연결들 중에서 결정된 초기 문턱값보다 낮은 가중치들을 갖는 것들도 존재할 것이다. 이러한 초기 문턱값보다 낮은 가중치를 갖는 파라미터들은 이 단계에서 제거된다. 여기서, 한 번 제거된 노드간 연결은 다시 복원하지 않고 그대로 제거된 상태에서 학습이 진행된다. 반복적인 학습을 통해 파라미터가 계속 제거되면, 어느 순간에는 문턱값보다 낮은 파라미터가 발생하지 않게 될 것이다.In step S220, iterative learning is performed starting with the initial threshold determined in step S210. First, as for the weights of each layer of the neural network, as learning progresses, a weight higher than the threshold may come down below the threshold. In addition, among the connections of the original neural network, there will be ones with weights lower than the determined initial threshold. Parameters with weights lower than this initial threshold are removed in this step. Here, the connection between nodes that have been removed once is not restored, and learning proceeds in a state where it is removed as it is. If parameters are continuously removed through iterative learning, at some point, parameters lower than the threshold will not occur.

S230 단계에서, 학습의 진행 결과로 생성되는 노드들 간의 가중치들 중에서 초기 문턱값보다 낮은 것이 존재하는지 검출된다. 만일, 학습의 진행 결과로 더 이상 초기 문턱값보다 낮은 가중치가 검출되지 않는다면(아니오 방향), 절차는 S240 단계로 이동한다. 반면, 여전히 초기 문턱값보다 낮은 가중치가 존재하는 것으로 검출되면(예 방향), 절차는 S220으로 복귀하여, 추가적인 학습 및 가중치 제거 절차가 진행될 것이다.In step S230 , it is detected whether there is a weight lower than an initial threshold among the weights between nodes generated as a result of the learning progress. If, as a result of the learning progress, a weight lower than the initial threshold is no longer detected (No direction), the procedure moves to step S240. On the other hand, if it is detected that there is still a weight lower than the initial threshold (yes direction), the procedure returns to S220, and additional learning and weight removal procedures will be performed.

S240 단계에서, 초기 문턱값보다 낮은 가중치가 더 이상 존재하지 않는 경우에는 압축 효율을 높이기 위해 문턱값의 상향 조정이 수행된다. 문턱값을 상향시킬 때에는 해당 계층마다 파라미터의 표준편차를 계산한다. 그리고 계산된 계층별 파라미터의 표준 편차에 일정한 비율을 곱하여 문턱값을 상향시킬 때 반영한다. 문턱값이 변화되면 상향된 문턱값을 적용한 재학습이 이루어지게 된다. In step S240 , when the weight lower than the initial threshold no longer exists, the threshold is adjusted upward in order to increase compression efficiency. When the threshold is raised, the standard deviation of the parameter is calculated for each layer. Then, the calculated standard deviation of the parameters for each layer is multiplied by a certain ratio and reflected when the threshold is raised. When the threshold value is changed, re-learning by applying the increased threshold value is performed.

S250 단계에서, 학습된 신경망의 파라미터들 중에서 각 계층 별로 상향된 문턱값 이하의 가중치가 존재하는지 검출될 것이다. 만일, 신경망의 파라미터들 중에서 더 이상 상향된 문턱값보다 낮은 가중치가 검출되지 않는다면(아니오 방향), 적응형 파라미터 제거 기법의 제반 절차는 종료된다. 반면, 여전히 상향된 문턱값보다 낮은 가중치가 존재하는 것으로 검출되면(예 방향), 절차는 S240으로 복귀하여, 추가적인 학습 및 가중치 제거 절차가 진행될 것이다.In step S250 , it will be detected whether a weight equal to or less than the raised threshold value exists for each layer among the parameters of the learned neural network. If, among the parameters of the neural network, a weight lower than the raised threshold is no longer detected (No direction), the overall procedure of the adaptive parameter removal technique is terminated. On the other hand, if it is detected that there is still a weight lower than the raised threshold (yes direction), the procedure returns to S240, and additional learning and weight removal procedures will be performed.

상술한 바와 같이 문턱값을 가변하면서 학습과 파라미터 제거를 반복하면, 저가중치를 갖는 파라미터들은 자연스럽게 제거되면서 최종적으로 신경망에서 꼭 필요한 파라미터만이 생존하게 된다. 적응형 파라미터 제거 기법을 사용하는 경우에는 오리지널 신경망은 이미 학습된 파라미터들이다. 따라서, 문턱값의 조정은 작은 스텝값 단위로 진행될 수 있다. 이러한 스텝값을 구하는 과정에서 해당 계층의 파라미터의 분포는 매우 중요하다.As described above, if learning and parameter removal are repeated while varying the threshold, parameters having low weights are naturally removed, and only parameters necessary for the neural network ultimately survive. When the adaptive parameter removal technique is used, the original neural network is already learned parameters. Accordingly, the adjustment of the threshold value may be performed in units of small step values. In the process of obtaining these step values, the distribution of the parameters of the corresponding layer is very important.

도 6은 도 5의 적응형 파라미터 제거 기법을 적용하는 경우에 파라미터들의 확률 분포를 단계별로 보여주는 도면이다. 도 6을 참조하면, 신경망의 가중치들은 본 발명의 적응형 파라미터 제거 기법에 의해서 문턱치 이상에 존재하는 파라미터들만이 생존할 것이다.FIG. 6 is a diagram showing the probability distribution of parameters in stages when the adaptive parameter removal technique of FIG. 5 is applied. Referring to FIG. 6 , as for the weights of the neural network, only parameters existing above a threshold will survive by the adaptive parameter removal technique of the present invention.

(a)는 오리지널 가중치들의 확률 분포 및 초기 문턱값을 보여준다. 오리지널 가중치는 학습의 진행 이후에 모든 노드들이 존재하는 오리지널 신경망의 가중치들을 의미한다. 학습에 의해서 생성된 이들 가중치들은 평균(0)을 중심으로 하는 정규 분포 형태를 이룬다. 더불어, 초기 문턱값보다 낮은 값을 갖는 가중치들이 전체 파라미터들 중에서 다수를 차지하고 있음을 알 수 있다. (a) shows the probability distribution of the original weights and the initial threshold. The original weight means the weights of the original neural network in which all nodes exist after the learning process. These weights generated by learning form a normal distribution centered on the mean (0). In addition, it can be seen that weights having a value lower than the initial threshold occupies a majority among all parameters.

(b)는 초기 문턱값보다 낮은 값을 갖는 가중치들을 제거하는 경우의 파라미터들의 확률 분포를 보여준다. 즉, 초기 문턱값을 사용한 프루닝(Pruning)에 의해서 평균을 중심으로 초기 문턱값의 양과 음의 구간에 포함되는 파라미터들은 제거된다.(b) shows the probability distribution of parameters when weights having values lower than the initial threshold are removed. That is, by pruning using the initial threshold, parameters included in the positive and negative sections of the initial threshold are removed based on the average.

(c)에는 초기 문턱값보다 낮은 가중치를 갖는 파라미터들을 제거한 후에 추가적인 학습을 수행한 결과로 생성되는 가중치의 분포가 도시되어 있다. 재학습에 의해서 초기 문턱값에서 날카로운 분포가 부드러운 형태의 분포로 변경된다. 하지만, 재학습의 결과로 초기 문턱값보다 낮은 가중치를 갖는 파라미터들이 존재하게 된다. 이러한 파라미터들은 프루닝과 재학습 루프의 반복을 통해서 제거될 수도 있음은 잘 이해될 것이다.(c) shows the distribution of weights generated as a result of performing additional learning after removing parameters with weights lower than the initial threshold. By re-learning, the sharp distribution at the initial threshold is changed to a smooth distribution. However, as a result of re-learning, parameters having a weight lower than the initial threshold exist. It will be well understood that these parameters may be removed through iteration of the pruning and re-learning loop.

(d)는 조정된 문턱값을 사용하여 저가중치의 파라미터들을 삭제하는 과정을 나타낸다. 즉, 초기 문턱값보다 더 높은 레벨의 상향된 문턱값을 계산하여 ④의 재학습 단계를 거치면, (e)의 최종 가중치 분포가 얻어진다.(d) shows the process of deleting low-weighted parameters using the adjusted threshold. That is, if an elevated threshold of a level higher than the initial threshold is calculated and the re-learning step of ④ is performed, the final weight distribution of (e) is obtained.

모든 파라미터들(특히, 가중치) 중에는 신경망 연산에 크게 영향을 미치지 않는 낮은 레벨의 가중치들이 존재한다. 확률 분포에서는 이러한 저가중치의 파라미터들이 상대적으로 많은 수를 차지하게 되고, 이러한 저가중치의 파라미터들은 컨볼루션 연산이나 활성화, 풀링 연산 등의 부담으로 작용한다. 본 발명의 적응형 파라미터 제거 기법에 따르면, 이러한 저가중치의 파라미터들을 신경망의 성능에 영향을 거의 미치지 않는 레벨에서 제거할 수 있다.Among all parameters (especially weights), there are low-level weights that do not significantly affect neural network computation. In the probability distribution, such low-weight parameters occupy a relatively large number, and these low-weight parameters act as burdens for convolution, activation, and pooling operations. According to the adaptive parameter removal technique of the present invention, these low-weight parameters can be removed at a level that hardly affects the performance of the neural network.

도 7은 본 발명의 실시 예에 따른 적응형 가중치 공유 기법을 보여주는 도면이다. 도 7을 참조하면, 적응형 대표 가중치 공유 기법에 따라 신경망 내에서 각각의 계층별로 비슷한 파라미터는 단일 대표값을 갖는 파라미터로 맵핑 및 공유된다.7 is a diagram illustrating an adaptive weight sharing technique according to an embodiment of the present invention. Referring to FIG. 7 , similar parameters for each layer in the neural network are mapped and shared as parameters having a single representative value according to the adaptive representative weight sharing technique.

적응형 파라미터 제거 기법에 따라 낮은 가중치를 갖는 파라미터들이 제거되면 각각의 계층들에 분포하는 가중치들은 양봉 분포(Bimodal distribution)를 갖는다. 양봉 분포는 음수 영역과 양수 영역으로 나누어진다. 이때 각각의 영역별로 최저값에서 최고값까지 균등하게 배분하여 대표값을 정하게 된다. 이러한 대표값을 센트로이드(Centroid)라고 부른다. 양봉 분포의 가중치들과 이들 각 영역에서의 예시적인 센트로이드의 본포는 도면 아래의 그래프에 도시되어 있다. When parameters with low weights are removed according to the adaptive parameter removal technique, the weights distributed in each layer have a bimodal distribution. The beekeeping distribution is divided into a negative region and a positive region. At this time, the representative value is determined by evenly distributing from the lowest value to the highest value for each area. These representative values are called centroids. The weights of the beekeeping distribution and the distribution of an exemplary centroid in each of these regions are shown in the graph below the figure.

각각의 영역별 센트로이드 값은 설정되는 전체 대표값의 수에 따라 결정된다. 만약, N개의 센트로이드 값을 사용할 경우, 음수 대역에 N/2개 센트로이드 값이 사용되고, 양수 대역에 N/2개의 센트로이드 값이 사용될 것이다. 초기 센트로이드 값들 각각은 균등한 차이값을 갖도록 선형적으로 배치시킨다. 도시된 그래프에서는 음수 대역에서는 4개의 센트로이드들(-3.5, -2.5, -1.5, -0.5)이 할당되고, 양수 대역에서는 4개의 센트로이드들(0.5, 1.5, 2.5, 3.5)이 할당될 수 있다. The centroid value for each area is determined according to the total number of representative values set. If N centroid values are used, N/2 centroid values will be used in the negative band, and N/2 centroid values will be used in the positive band. Each of the initial centroid values is arranged linearly to have an equal difference value. In the graph shown, four centroids (-3.5, -2.5, -1.5, -0.5) are allocated in the negative band, and four centroids (0.5, 1.5, 2.5, 3.5) can be allocated in the positive band. have.

상술한 센트로이드 설정에 따라 예시적인 실 가중치 세트(320)의 가중치들은 센트로이드 값을 기준으로 근사화된다. 즉, 실 가중치들은 센트로이드 값(345)으로 근사화되고, 근사화된 이후에는 센트로이드 인덱스(340)로 맵핑될 수 있다. 이러한 대표값으로의 근사화를 통해서 실 가중치 세트(320)는 인덱스 맵(360)으로 변환될 수 있다. 선형적으로 배치된 초기 센트로이드 값은 재학습 과정을 통해 정제(Refine)된다.According to the above-described centroid setting, the weights of the exemplary real weight set 320 are approximated based on the centroid value. That is, the real weights may be approximated by the centroid value 345 , and may be mapped to the centroid index 340 after being approximated. Through approximation to such a representative value, the real weight set 320 may be converted into an index map 360 . The linearly arranged initial centroid values are refined through a re-learning process.

센트로이드 값이 딥러닝 인식 엔진에 사용될 때, 센트로이드 인덱스(340)와 센트로이드 값의 맵핑 테이블이 사용될 수 있다. 인식 연산을 수행하기 전에 해당 계층의 센트로이드 맵핑 테이블을 읽어와서 인식 하드웨어 엔진에 저장하면, 그 이후에는 인덱스 값만 메모리로부터 읽어와서 처리할 수 있다.When the centroid value is used in the deep learning recognition engine, a mapping table between the centroid index 340 and the centroid value may be used. If the centroid mapping table of the corresponding layer is read before the recognition operation is performed and stored in the recognition hardware engine, only the index value can be read from the memory and processed thereafter.

이상에서는 적응형 대표 가중치 공유 방법의 예로 센트로이드를 사용하는 방법이 설명되었다. 하지만, 본 발명의 적응형 대표 가중치 공유 방법은 여기에 국한되지 않으며, 다양한 방식으로 대표값이 맵핑되고 공유될 수 있음은 잘 이해될 것이다.In the above, a method using a centroid as an example of an adaptive representative weight sharing method has been described. However, it will be understood that the adaptive representative weight sharing method of the present invention is not limited thereto, and representative values may be mapped and shared in various ways.

도 8은 본 발명의 효과를 예시적으로 보여주는 표이다. 도 8을 참조하면, 도 2에 도시한 신경망(예를 들면, LeNet 네트워크)의 파라미터가 정확도 감쇄없이 파라미터를 줄인 결과가 도시되어 있다. 8 is a table exemplarily showing the effects of the present invention. Referring to FIG. 8 , the result of reducing the parameters of the neural network (eg, LeNet network) shown in FIG. 2 without loss of accuracy is shown.

적응형 파라미터 제거 기법을 이용해서 파라미터 중에 가중치 부분이 전체 430,500개에서 12,261개까지 줄어들 수 있음을 확인하였다. 이는 전체 가중치에서 97.15%는 사용하지 않고, 단지 2.85%만 사용해도 정확도에 영향이 없이 필기체 숫자 인식이 가능함을 의미한다. 또한, 적응형 대표 가중치 공유 기법을 사용하면 LeNet 신경망에서 각각의 계층(Layer) 별로 센트로이드 파라미터 8개만 사용해도 숫자 인식에 문제점이 없음을 알 수 있다. 실제로 사용되는 가중치의 총수는 단지 32개에 불과하다. 따라서, 12,261개의 파라미터가 메모리에 저장시에는 가중치 값이 아닌 대표 가중치를 가리키는 인덱스 값으로 저장되면 된다. By using the adaptive parameter removal technique, it was confirmed that the weights among parameters can be reduced from a total of 430,500 to 12,261. This means that 97.15% of the total weight is not used, and even if only 2.85% is used, it is possible to recognize handwritten digits without affecting the accuracy. In addition, when the adaptive representative weight sharing technique is used, it can be seen that there is no problem in number recognition even if only 8 centroid parameters are used for each layer in the LeNet neural network. In practice, the total number of weights used is only 32. Accordingly, when the 12,261 parameters are stored in the memory, they may be stored as index values indicating representative weights rather than weight values.

상술한 적응형 파라미터 제거 기법과 적응형 대표 가중치 공유 기법을 사용하면, 신경망 구동을 위한 메모리의 사이즈가 획기적으로 감소될 수 있다. 더불어, 메모리에 저장되는 파라미터의 감소에 따른 메모리 대역폭 요구량도 대폭 줄어든다. 더불어, 파라미터 수의 감소로 인한 컨볼루션 연산, 풀링 연산, 활성화 연산 등에 소요되는 전력도 획기적으로 줄어들 것으로 기대된다.When the above-described adaptive parameter removal technique and adaptive representative weight sharing technique are used, the size of a memory for driving a neural network can be remarkably reduced. In addition, the memory bandwidth requirement according to the reduction of parameters stored in the memory is also greatly reduced. In addition, the power required for convolution operation, pooling operation, activation operation, etc. due to the decrease in the number of parameters is expected to be drastically reduced.

이상에서 설명한 바와 같이, 본 발명은 신경망을 하드웨어로 구현시에 가장 큰 문제점인 연산량을 줄이기 위한 파라미터 압축 방법에 대해서 제안한 것이다. 본 발명의 적응형 파라미터 제거 기법과 적응형 대표 가중치 공유 기법을 사용하게 되면, 인식 정확도의 감쇄 효과가 없이 동일한 인식율을 갖는 출력을 상대적으로 적은 파라미터를 사용해서 구현할 수 있다. 또한, 본 발명의 압축 방법에 따르면 파라미터의 크기가 수백 메가 바이트(MByte) 규모에서 수 메가 바이트(MByte) 규모로 수백 배 이상 압축 가능하다. 따라서, 모바일 단말에서도 딥러닝 네트워크를 이용한 사물 인식이 가능해진다. 이러한 특징은 에너지 관리 측면에서도 매우 유리한 특징이다. As described above, the present invention proposes a parameter compression method for reducing the amount of computation, which is the biggest problem when implementing a neural network in hardware. When the adaptive parameter removal technique and the adaptive representative weight sharing technique of the present invention are used, an output having the same recognition rate can be implemented using a relatively small number of parameters without the effect of attenuating the recognition accuracy. In addition, according to the compression method of the present invention, the size of the parameter can be compressed several hundred times or more from the scale of several megabytes (MByte) to the scale of several megabytes (MByte). Accordingly, object recognition using a deep learning network is possible even in a mobile terminal. This feature is also very advantageous in terms of energy management.

위에서 설명한 내용은 본 발명을 실시하기 위한 구체적인 예들이다. 본 발명에는 위에서 설명한 실시 예들뿐만 아니라, 단순하게 설계 변경하거나 용이하게 변경할 수 있는 실시 예들도 포함될 것이다. 또한, 본 발명에는 상술한 실시 예들을 이용하여 앞으로 용이하게 변형하여 실시할 수 있는 기술들도 포함될 것이다.The contents described above are specific examples for carrying out the present invention. The present invention will include not only the above-described embodiments, but also simple design changes or easily changeable embodiments. In addition, the present invention will include techniques that can be easily modified and implemented in the future using the above-described embodiments.

Claims

A method of operating a convolutional neural network system, comprising:
performing learning on weights between neural network nodes using input data;
an adaptive parameter removal step of performing learning using the input data after removing a weight having a size smaller than a threshold value from among the weights; and
an adaptive weight sharing step of mapping the weights surviving in the adaptive parameter removal step to a plurality of representative values;
The adaptive parameter removal step includes:
determining an initial threshold value for each layer of the neural network;
performing weight removal and learning using the initial threshold; and
and performing weight removal and learning using an upward threshold having a value greater than the initial threshold.

The method of claim 1,
In the step of performing the learning, the learning is performed in a state including all nodes of the neural network, and learned weights for connections between all the nodes are generated.

delete

The method of claim 1,
In the determining of the initial threshold value, the initial threshold value of each of the layers is applied by sequentially varying the adjusted threshold value while maintaining the connection of the other layers, and setting a threshold value lower than the reference accuracy. An operating method of determining the initial threshold value of each of the layers.

The method of claim 1,
The step of weight removal and learning using the initial threshold includes:
removing weights having a size smaller than the initial threshold from among the weights of each of the layers; and
and performing learning on survival weights from which weights having a size smaller than the initial threshold have been removed.

6. The method of claim 5,
The steps of removing weights having a size smaller than the initial threshold and performing learning on the survival weights constitute an iterative loop, wherein the iterative loop is performed when weights having a size smaller than the initial threshold are removed. Repeated operation method until .

The method of claim 1,
The step of weight removal and learning using an upward threshold having a value greater than the initial threshold includes:
removing weights smaller than the upward threshold from among the survival weights; and
and performing learning on weights equal to or greater than the upward threshold.

8. The method of claim 7,
The steps of removing the weights having a size smaller than the upward threshold and learning the weights having a size equal to or greater than the upward threshold constitute an iterative loop, wherein the iterative loop is smaller than the upward threshold. A method of operation that is repeated until smaller weights are removed.

The method of claim 1,
In the adaptive weight sharing step, a plurality of representative values are determined as centroid values of the surviving weights.

10. The method of claim 9,
The centroid value is refined through re-learning of the surviving weight.

In a convolutional neural network system:
an input buffer for buffering input data;
an arithmetic unit for learning parameters between a plurality of neural network nodes using the input data;
an output buffer for storing and updating the learning result of the operation unit;
a parameter buffer that transmits the parameters between the plurality of neural network nodes to the operation unit and updates the parameters according to a result of the learning; and
a control unit that controls the parameter buffer to remove weights having a size smaller than a threshold among the weights between the neural network nodes, and maps survival weights from among the weights to at least one representative value,
The control unit performs learning on weights between neural network nodes using the input data, removes a weight having a size smaller than a threshold value from among the weights, performs re-learning, and sets the survival weights. mapping to a plurality of representative values,
The control unit generates first survival weights by executing a first iterative loop that performs learning by removing weights smaller than a first threshold from among the weights, and uses a second threshold greater than the first threshold to generate the survival weights by executing a second iterative loop that removes weights smaller than a second threshold among the first survival weights.

delete

12. The method of claim 11,
The control unit determines the threshold value for each layer of the neural network.

delete

12. The method of claim 11,
The representative value is a neural network system corresponding to a centroid.

12. The method of claim 11,
The control unit stores a centroid index to which the representative value is mapped and a mapping table of the centroid, and exchanges the centroid index with an external memory as the survival weight.