KR102315617B1

KR102315617B1 - Apparatus and method for neural network pruning considering structure of graphic processing device

Info

Publication number: KR102315617B1
Application number: KR1020210060989A
Authority: KR
Inventors: 양회석; 최규식
Original assignee: 아주대학교산학협력단
Priority date: 2021-05-11
Filing date: 2021-05-11
Publication date: 2021-10-20

Abstract

An objective of the present application is to provide a device and method for neural network pruning considering a structure of a graphic processing device capable of accelerating inference speed while minimizing a loss of accuracy of neural network inference performed in the graphic processing device. According to a first embodiment of the present application, the method for neural network pruning may comprise: a step of performing GEMM transformation of a weighted kernel of a neural network operating through a graphic processing unit (GPU) including a plurality of computation units; and a step of pruning the GEMM-transformed weighted kernel using a block, which is a division unit of a matrix multiplication operation performed in parallel through a plurality of operation units, as a unit.

Description

Apparatus and method for neural network pruning considering the structure of graphic processing unit

본원은 그래픽 처리 장치의 구조를 고려한 신경망 프루닝 장치 및 방법에 관한 것이다.The present application relates to a neural network pruning apparatus and method in consideration of the structure of a graphic processing apparatus.

합성곱 신경망(Convolutional Neural Network, CNN)은 인가된 이미지의 특징을 추출(Feature extraction)하는 성능이 매우 뛰어나, 이미지 분류(Classification), 분할(Segmentation) 및 인식(Recognition) 등의 컴퓨터 비전(Computer Vision) 영역과, 자연어 처리(Natural Language Processing, NLP) 및 자율 주행(Autonomous driving) 등의 인공 신경망(Artificial Neural Network, ANN)의 적용에 널리 활용되고 있다. 특히 자연어 처리 및 자율 주행 분야에선 GPU의 범용 계산 활용(General Purpose computing on Graphics Processing Units, GPGPU)이 일반적이며 이를 통해 GPU의 대규모 병렬성을 활용한 데이터 스트림 프로세싱(Data stream processing)의 연산 가속을 활용한다.Convolutional Neural Network (CNN) is very good at extracting features of an applied image, so it can perform computer vision such as image classification, segmentation, and recognition. ), and the application of artificial neural networks (ANNs) such as natural language processing (NLP) and autonomous driving. In particular, in the field of natural language processing and autonomous driving, the general purpose computing on graphics processing units (GPGPU) of GPUs is common, and through this, the computational acceleration of data stream processing utilizing the massive parallelism of the GPU is utilized. .

특히, 합성곱 신경망은 여러 합성곱 층을 깊게 쌓아 올리는 심층 신경망(Deep Neural Network, DNN)의 구성을 통해 그 성능이 극대화 되었으나, 이러한 성능 향상은 막대한 계산량과 메모리 사용량을 요구하게 된다. 더욱이, 합성곱 신경망의 모든 연산 중 99%가 합성곱 층에서 수행될 만큼 전연결 층에 비해 적은 가중치 개수를 가지고 있음에도 합성곱 층에 연산이 편향되어 있음을 알 수 있다.In particular, the performance of the convolutional neural network has been maximized through the construction of a deep neural network (DNN) in which several convolutional layers are deeply stacked, but this performance improvement requires a huge amount of computation and memory usage. Moreover, it can be seen that the operation is biased in the convolutional layer even though 99% of all operations of the convolutional neural network have a small number of weights compared to the all-connected layer to be performed in the convolutional layer.

이러한 합성곱 신경망의 한계점을 극복하기 위해 다양한 신경망 경량화 기법이 연구되어 왔으며 그 중 프루닝(Pruning) 기법은 신경망 내부에 존재하는 불필요한 가중치를 제거하는 과정을 통해 신경망을 압축한다. 일반적으로 합성곱 신경망의 내부 가중치는 과잉(Over-parameterized)되어 있어, 내부에 제거되어도 정확도에 큰 영향을 미치지 않는 여분의(Redundancy) 가중치가 존재함을 고려하여 프루닝 기법은, 이런 여분의 가중치의 값을 0으로 만드는 과정을 통하여 연산량과 메모리 사용량을 줄일 수 있다.To overcome these limitations of convolutional neural networks, various neural network lightweighting techniques have been studied, and among them, the pruning technique compresses the neural network through the process of removing unnecessary weights existing in the neural network. In general, the internal weights of the convolutional neural network are over-parameterized, so considering that there are redundant weights that do not significantly affect the accuracy even if they are removed, the pruning method uses these extra weights. Through the process of making the value of 0 to 0, the amount of computation and memory usage can be reduced.

이와 관련하여, 프루닝 기법은 제거되는 가중치 단위에 따라 비구조적/구조적 프루닝으로 나뉘며, 먼저 비구조적 프루닝 기법을 통해 제거된 합성곱 층의 가중치 영역은 불규칙적인 희소성을 가지게 되며 이는 특별한 연산 라이브러리 또는 별도의 하드웨어 가속기 없이는 GPU 상에서 성능 이득을 얻기에 제한적이라는 한계가 있다.In this regard, the pruning technique is divided into unstructured/structured pruning according to the weight unit to be removed. First, the weight region of the convolutional layer removed through the unstructured pruning technique has irregular sparseness, which is a special computation library. Alternatively, there is a limit in that it is limited to obtain a performance gain on the GPU without a separate hardware accelerator.

한편, 도 1은 종래의 구조적 프루닝 기법 중 하나인 커널-채널 프루닝 기법을 나타낸 개념도이다. 도 1을 참조하면, 구조적 프루닝 기법은 규칙적인 가중치 영역을 제거하기 때문에 프루닝 기법만으로 성능 향상을 얻을 수 있지만 규칙적인 가중치의 제거는 무시할 수 없는 합성곱 신경망의 정확도 열화를 야기하며 GPU와 같은 대규모 병렬성을 갖은 가속기에서는 일부 규칙적인 가중치의 제거가 오히려 연산 속도의 저하를 불러올 수 있다는 문제가 있다.Meanwhile, FIG. 1 is a conceptual diagram illustrating a kernel-channel pruning technique, which is one of the conventional structural pruning techniques. Referring to FIG. 1 , since the structural pruning technique removes the regular weight region, performance improvement can be obtained only with the pruning technique, but the removal of the regular weight causes the accuracy deterioration of the convolutional neural network that cannot be ignored, and In an accelerator with massive parallelism, there is a problem that the removal of some regular weights may cause a decrease in computational speed.

본원의 배경이 되는 기술은 한국공개특허공보 제10-2020-0145648호에 개시되어 있다.The technology that is the background of the present application is disclosed in Korean Patent Application Laid-Open No. 10-2020-0145648.

본원은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 그래픽 처리 장치에서 수행되는 신경망 추론의 정확도 손실을 최소로 하면서도 추론 속도를 가속할 수 있는 그래픽 처리 장치의 구조를 고려한 신경망 프루닝 장치 및 방법을 제공하려는 것을 목적으로 한다.The present application provides a neural network pruning apparatus and method in consideration of the structure of a graphics processing apparatus capable of accelerating the inference speed while minimizing the loss of accuracy of neural network inference performed in the graphics processing apparatus in order to solve the problems of the prior art described above. intended to provide.

다만, 본원의 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다.However, the technical problems to be achieved by the embodiments of the present application are not limited to the technical problems as described above, and other technical problems may exist.

상기한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본원의 제1실시예에 따른 그래픽 처리 장치의 구조를 고려한 신경망 프루닝 방법은, 복수의 연산 유닛(Compute Unit)을 포함하는 그래픽 처리 장치(Graphic Processing Unit, GPU)를 통해 동작하는 신경망의 가중치 커널을 GEMM 변환하는 단계 및 상기 복수의 연산 유닛을 통해 병렬적으로 수행되는 행렬 곱셈 연산의 분할 단위인 블록을 단위로 하여 상기 GEMM 변환된 가중치 커널을 프루닝하는 단계를 포함할 수 있다.As a technical means for achieving the above technical problem, the neural network pruning method in consideration of the structure of the graphic processing device according to the first embodiment of the present application is a graphic processing unit including a plurality of computation units (Compute Unit). GEMM transformation of the weight kernel of the neural network operating through the unit, GPU) and the GEMM-transformed weight kernel by using the block as a division unit of the matrix multiplication operation performed in parallel through the plurality of operation units as a unit. It may include the step of looping.

또한, 상기 그래픽 처리 장치는, 글로벌 메모리 및 상기 복수의 연산 유닛 각각에 대하여 구비되는 복수의 로컬 메모리를 포함할 수 있다.Also, the graphic processing apparatus may include a global memory and a plurality of local memories provided for each of the plurality of operation units.

또한, 상기 블록은, 상기 행렬 곱셈 연산을 수행하기 위하여 상기 글로벌 메모리로부터 상기 로컬 메모리로 복사되는 상기 GEMM 변환된 가중치 커널의 구획된 일부분에 대응할 수 있다.Also, the block may correspond to a partitioned portion of the GEMM-transformed weight kernel copied from the global memory to the local memory in order to perform the matrix multiplication operation.

또한, 상기 복수의 연산 유닛 각각은 미리 설정된 복수 개의 상기 블록에 대한 상기 행렬 곱셈 연산을 수행할 수 있다.In addition, each of the plurality of operation units may perform the matrix multiplication operation on the plurality of preset blocks.

또한, 상기 프루닝하는 단계는, 상기 복수의 연산 유닛 각각과 연계된 복수 개의 상기 블록 중 미리 설정된 비율의 블록을 제거할 수 있다.In addition, the pruning may include removing blocks of a preset ratio among the plurality of blocks associated with each of the plurality of operation units.

또한, 상기 프루닝하는 단계에 의해 제거된 블록은 상기 신경망의 추론 과정에서 상기 글로벌 메모리로부터 상기 로컬 메모리로 미복사되는 것일 수 있다.In addition, the block removed by the pruning may be not copied from the global memory to the local memory in the reasoning process of the neural network.

또한, 상기 프루닝하는 단계는, 상기 블록 각각에 대한 가중치 중요도를 연산하는 단계 및 상기 연산된 가중치 중요도에 기초하여 상기 미리 설정된 비율로 제거할 블록을 결정하는 단계를 포함할 수 있다.Also, the pruning may include calculating a weight importance for each block and determining a block to be removed at the preset ratio based on the calculated weight importance.

한편, 본원의 제2실시예에 따른 그래픽 처리 장치의 구조를 고려한 신경망 프루닝 방법은, 각각이 복수의 연산 처리 소자(Processing Element, PE)를 포함하는 복수의 연산 유닛(Compute Unit)을 포함하는 그래픽 처리 장치(Graphic Processing Unit, GPU)를 통해 동작하는 신경망의 가중치 커널을 GEMM 변환하는 단계 및 상기 복수의 연산 처리 소자를 통해 병렬적으로 수행되는 행렬 곱셈 연산의 분할 단위인 미세 블록을 단위로 하여 상기 GEMM 변환된 가중치 커널을 프루닝하는 단계를 포함할 수 있다.On the other hand, the neural network pruning method in consideration of the structure of the graphic processing device according to the second embodiment of the present application includes a plurality of arithmetic units (Compute Unit) each including a plurality of arithmetic processing elements (PE). Transforming the weight kernel of a neural network operating through a graphic processing unit (GPU) to GEMM and using the microblock, which is a division unit of a matrix multiplication operation performed in parallel through the plurality of operation processing elements, as a unit and pruning the GEMM-transformed weight kernel.

또한, 상기 그래픽 처리 장치는, 글로벌 메모리, 상기 복수의 연산 유닛 각각에 대하여 구비되는 복수의 로컬 메모리 및 상기 복수의 연산 처리 소자 각각에 대하여 구비되는 복수의 프라이빗 메모리를 포함할 수 있다.Also, the graphic processing apparatus may include a global memory, a plurality of local memories provided for each of the plurality of arithmetic units, and a plurality of private memories provided for each of the plurality of arithmetic processing elements.

또한, 상기 미세 블록은, 상기 행렬 곱셈 연산을 수행하기 위하여 상기 글로벌 메모리로부터 상기 로컬 메모리로 복사되는 상기 GEMM 변환된 가중치 커널의 구획된 일부분인 블록에 포함된 가중치 행(row)에 대응할 수 있다.Also, the microblock may correspond to a weight row included in a block that is a partitioned part of the GEMM-transformed weight kernel copied from the global memory to the local memory to perform the matrix multiplication operation.

또한, 상기 프루닝하는 단계는, 상기 복수의 연산 처리 소자 각각과 연계된 복수 개의 상기 미세 블록 중 미리 설정된 비율의 미세 블록을 제거할 수 있다.Also, in the pruning, a preset ratio of the fine blocks among the plurality of fine blocks associated with each of the plurality of arithmetic processing elements may be removed.

또한, 상기 프루닝하는 단계에 의해, 상기 복수의 연산 처리 소자 각각의 연산 부하가 동등하게 결정될 수 있다.In addition, by the pruning, the computational load of each of the plurality of computational processing elements may be equally determined.

또한, 상기 프루닝하는 단계는, 상기 미세 블록 각각에 대한 가중치 중요도를 연산하는 단계 및 상기 연산된 가중치 중요도에 기초하여 상기 미리 설정된 비율로 제거할 미세 블록을 결정하는 단계를 포함할 수 있다.Also, the pruning may include calculating a weight importance for each of the fine blocks and determining the fine blocks to be removed at the preset ratio based on the calculated weight importance.

한편, 본원의 제1실시예에 따른 그래픽 처리 장치의 구조를 고려한 신경망 프루닝 장치는, 복수의 연산 유닛(Compute Unit)을 포함하는 그래픽 처리 장치(Graphic Processing Unit, GPU)를 통해 동작하는 신경망의 가중치 커널을 GEMM 변환하는 가중치 변환부 및 상기 복수의 연산 유닛을 통해 병렬적으로 수행되는 행렬 곱셈 연산의 분할 단위인 블록을 단위로 하여 상기 GEMM 변환된 가중치 커널을 프루닝하는 프루닝부를 포함할 수 있다.On the other hand, the neural network pruning apparatus in consideration of the structure of the graphics processing apparatus according to the first embodiment of the present application is a neural network operating through a graphics processing unit (GPU) including a plurality of computation units (Compute Unit). A weight conversion unit for GEMM transformation of the weight kernel, and a pruning unit for pruning the GEMM-transformed weight kernel using a block as a division unit of a matrix multiplication operation performed in parallel through the plurality of operation units as a unit. have.

또한, 상기 프루닝부는, 상기 블록 각각에 대한 가중치 중요도를 연산하고, 상기 연산된 가중치 중요도에 기초하여 상기 미리 설정된 비율로 제거할 블록을 결정할 수 있다.Also, the pruning unit may calculate weight importance for each of the blocks, and determine blocks to be removed at the preset ratio based on the calculated weight importance.

한편, 본원의 제2실시예에 따른 그래픽 처리 장치의 구조를 고려한 신경망 프루닝 장치는, 각각이 복수의 연산 처리 소자(Processing Element, PE)를 포함하는 복수의 연산 유닛(Compute Unit)을 포함하는 그래픽 처리 장치(Graphic Processing Unit, GPU)를 통해 동작하는 신경망의 가중치 커널을 GEMM 변환하는 가중치 변환부 및 상기 복수의 연산 처리 소자를 통해 병렬적으로 수행되는 행렬 곱셈 연산의 분할 단위인 미세 블록을 단위로 하여 상기 GEMM 변환된 가중치 커널을 프루닝하는 프루닝부를 포함할 수 있다.On the other hand, the neural network pruning apparatus in consideration of the structure of the graphic processing apparatus according to the second embodiment of the present application includes a plurality of operation units (Compute Unit) each including a plurality of processing elements (Processing Element, PE) A weight conversion unit that transforms a weight kernel of a neural network operating through a graphic processing unit (GPU) into a GEMM and a fine block that is a division unit of a matrix multiplication operation performed in parallel through the plurality of operation processing elements It may include a pruning unit for pruning the GEMM-transformed weight kernel.

또한, 상기 프루닝부는, 상기 복수의 연산 처리 소자 각각과 연계된 복수 개의 상기 미세 블록 중 미리 설정된 비율의 미세 블록을 제거하여 상기 복수의 연산 처리 소자 각각의 연산 부하를 동등하게 결정할 수 있다.In addition, the pruning unit may equally determine the computational load of each of the plurality of arithmetic processing elements by removing a preset ratio of the fine blocks from among the plurality of fine blocks associated with each of the plurality of arithmetic processing elements.

또한, 상기 프루닝부는, 상기 미세 블록 각각에 대한 가중치 중요도를 연산하고, 상기 연산된 가중치 중요도에 기초하여 상기 미리 설정된 비율로 제거할 미세 블록을 결정할 수 있다.Also, the pruning unit may calculate a weight importance for each of the fine blocks, and determine the fine blocks to be removed at the preset ratio based on the calculated weight importance.

상술한 과제 해결 수단은 단지 예시적인 것으로서, 본원을 제한하려는 의도로 해석되지 않아야 한다. 상술한 예시적인 실시예 외에도, 도면 및 발명의 상세한 설명에 추가적인 실시예가 존재할 수 있다.The above-described problem solving means are merely exemplary, and should not be construed as limiting the present application. In addition to the exemplary embodiments described above, additional embodiments may exist in the drawings and detailed description.

전술한 본원의 과제 해결 수단에 의하면, 그래픽 처리 장치에서 수행되는 신경망 추론의 정확도 손실을 최소로 하면서도 추론 속도를 가속할 수 있는 그래픽 처리 장치의 구조를 고려한 신경망 프루닝 장치 및 방법을 제공할 수 있다.According to the above-described problem solving means of the present application, it is possible to provide a neural network pruning apparatus and method in consideration of the structure of a graphic processing unit capable of accelerating the inference speed while minimizing the loss of accuracy of the neural network inference performed in the graphic processing unit. .

전술한 본원의 과제 해결 수단에 의하면, 구조적 가중치 프루닝 기법에 비해 훨씬 적은 정확도 손실을 보이면서도 연산 속도의 이득을 얻을 수 있다.According to the above-described problem solving means of the present application, it is possible to obtain a gain in operation speed while showing a much smaller loss of accuracy compared to the structural weight pruning technique.

다만, 본원에서 얻을 수 있는 효과는 상기된 바와 같은 효과들로 한정되지 않으며, 또 다른 효과들이 존재할 수 있다.However, the effects obtainable herein are not limited to the above-described effects, and other effects may exist.

도 1은 종래의 구조적 프루닝 기법 중 하나인 커널-채널 프루닝 기법을 나타낸 개념도이다.
도 2는 본원의 제1실시예에 따른 그래픽 처리 장치의 구조를 고려한 신경망 프루닝 장치의 개략적인 구성도이다.
도 3 및 도 4는 신경망의 입력 피쳐맵과 가중치 커널을 GEMM 변환하는 프로세스를 나타낸 개념도이다.
도 5는 그래픽 처리 장치의 계층 구조 및 GEMM 변환된 가중치 커널을 이용한 행렬 곱셈 연산의 분할 단위를 설명하기 위한 개념도이다.
도 6은 GEMM 변환된 가중치 커널을 이용한 행렬 곱셈 연산과 연계된 블록화/타일링 프로세스를 나타낸 개념도이다.
도 7은 본원의 제2실시예에 따른 그래픽 처리 장치의 구조를 고려한 신경망 프루닝 장치의 개략적인 구성도이다.
도 8은 본원의 제2실시예에 따른 그래픽 처리 장치의 구조를 고려한 신경망 프루닝 장치에 의해 수행되는 미세 블록을 단위로 한 프루닝 기법을 설명하기 위한 개념도이다.
도 9a 내지 도 9c는 종래의 프루닝 기법과 본원에서 개시하는 프루닝 기법에 의한 프루닝 결과를 비교하여 나타낸 도면이다.
도 10은 본원의 제1실시예에 따른 그래픽 처리 장치의 구조를 고려한 신경망 프루닝 방법에 대한 동작 흐름도이다.
도 11은 본원의 제2실시예에 따른 그래픽 처리 장치의 구조를 고려한 신경망 프루닝 방법에 대한 동작 흐름도이다.1 is a conceptual diagram illustrating a kernel-channel pruning technique, which is one of the conventional structural pruning techniques.
2 is a schematic configuration diagram of a neural network pruning apparatus in consideration of the structure of the graphic processing apparatus according to the first embodiment of the present application.
3 and 4 are conceptual diagrams illustrating a process of transforming an input feature map and a weight kernel of a neural network into a GEMM.
5 is a conceptual diagram for explaining a division unit of a matrix multiplication operation using a hierarchical structure of a graphic processing device and a GEMM-transformed weight kernel.
6 is a conceptual diagram illustrating a blocking/tiling process associated with a matrix multiplication operation using a GEMM-transformed weight kernel.
7 is a schematic configuration diagram of a neural network pruning apparatus in consideration of the structure of a graphic processing apparatus according to a second embodiment of the present application.
8 is a conceptual diagram for explaining a pruning technique in units of fine blocks performed by a neural network pruning device considering the structure of a graphic processing device according to a second embodiment of the present application.
9A to 9C are views showing comparison of the pruning results by the conventional pruning technique and the pruning technique disclosed herein.
10 is an operation flowchart of a neural network pruning method in consideration of the structure of the graphic processing apparatus according to the first embodiment of the present application.
11 is an operation flowchart of a neural network pruning method in consideration of the structure of the graphic processing apparatus according to the second embodiment of the present application.

아래에서는 첨부한 도면을 참조하여 본원이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본원의 실시예를 상세히 설명한다. 그러나 본원은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본원을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, embodiments of the present application will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art to which the present application pertains can easily implement them. However, the present application may be embodied in several different forms and is not limited to the embodiments described herein. And in order to clearly explain the present application in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

본원 명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결" 또는 "간접적으로 연결"되어 있는 경우도 포함한다. Throughout this specification, when a part is "connected" with another part, it is not only "directly connected" but also "electrically connected" or "indirectly connected" with another element interposed therebetween. "Including cases where

본원 명세서 전체에서, 어떤 부재가 다른 부재 "상에", "상부에", "상단에", "하에", "하부에", "하단에" 위치하고 있다고 할 때, 이는 어떤 부재가 다른 부재에 접해 있는 경우뿐 아니라 두 부재 사이에 또 다른 부재가 존재하는 경우도 포함한다.Throughout this specification, when it is said that a member is positioned "on", "on", "on", "under", "under", or "under" another member, this means that a member is positioned on the other member. It includes not only the case where they are in contact, but also the case where another member exists between two members.

본원 명세서 전체에서, 어떤 부분이 어떤 구성 요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성 요소를 제외하는 것이 아니라 다른 구성 요소를 더 포함할 수 있는 것을 의미한다.Throughout this specification, when a part "includes" a component, it means that other components may be further included, rather than excluding other components, unless otherwise stated.

도 2는 본원의 제1실시예에 따른 그래픽 처리 장치의 구조를 고려한 신경망 프루닝 장치의 개략적인 구성도이다.2 is a schematic configuration diagram of a neural network pruning apparatus in consideration of the structure of the graphic processing apparatus according to the first embodiment of the present application.

도 2를 참조하면, 본원의 제1실시예에 따른 그래픽 처리 장치의 구조를 고려한 신경망 프루닝 장치(100)(이하, '프루닝 장치(100)'라 함.)는 가중치 변환부(110), 프루닝부(120) 및 회복 재훈련부(130)를 포함할 수 있다.Referring to FIG. 2 , the neural network pruning apparatus 100 (hereinafter referred to as 'pruning apparatus 100') in consideration of the structure of the graphic processing apparatus according to the first embodiment of the present application is a weight conversion unit 110 . , it may include a pruning unit 120 and a recovery retraining unit 130 .

한편, 본원의 실시예에 관한 설명에서, 본원의 제1실시예에 따른 프루닝 장치(100)와 본원의 제2실시예에 따른 그래픽 처리 장치의 구조를 고려한 신경망 프루닝 장치(100')(이하, '프루닝 장치(100')'라 함.)는 모두 합성곱 신경망의 합성곱 층(Convolution Layer) 등에서 신경망의 학습 과정에서 합성곱 연산이 그래픽 처리 장치(Graphic Processing Unit)(이하, 'GPU(200)'라 함.)를 통해 수행되는 경우, 합성곱 연산이 GPU(200)에 포함된 복수의 하위 연산 수단에 의해 분할되어 병렬적으로 처리되는 특성을 고려하여 합성곱 연산이 분할되어 연산되는 단위를 프루닝의 단위로 하여 프루닝을 수행하는 점에서 공통되나, 연산의 분할 단위(즉, 프루닝 단위)의 유형을 기준으로 구분되는 것일 수 있다.On the other hand, in the description of the embodiment of the present application, the neural network pruning apparatus 100' considering the structure of the pruning apparatus 100 according to the first embodiment of the present application and the graphic processing apparatus according to the second embodiment of the present application ( Hereinafter, referred to as 'pruning device 100'.) All of the convolutional operations in the learning process of a neural network in the convolution layer of a convolutional neural network are performed by a graphic processing unit (Graphic Processing Unit) (hereinafter, ' When performed through the GPU 200 '.), the convolution operation is divided in consideration of the characteristic that the convolution operation is divided by a plurality of sub-operation means included in the GPU 200 and processed in parallel. It is common in that pruning is performed using the unit to be calculated as the unit of pruning, but may be classified based on the type of division unit (ie, pruning unit) of the operation.

요약하면, 본원의 제1실시예예 따른 프루닝 장치(100)는 '블록'을 단위로 하여 프루닝을 수행하고, 본원의 제2실시예에 따른 프루닝 장치(100')는 '미세 블록'을 단위로 하여 프루닝을 수행하는 것으로 양자가 구분되는 것일 수 있다.In summary, the pruning apparatus 100 according to the first embodiment of the present application performs pruning by 'block', and the pruning apparatus 100 ' according to the second embodiment of the present application is a 'fine block'. The pruning may be performed using .

참고로, 본원에서의 신경망은 컨볼루션 신경망(Convolutional Neural Network, CNN)을 포함할 수 있다. 다만, 이에만 한정되는 것은 아니며, 본원이 적용되는 신경망은 순환신경망(RNN, Recurrent Neural Network) 등 종래에 이미 공지되었거나 향후 개발되는 다양한 신경망(이는 훈련된 신경망, 훈련되지 않은 신경망 등을 포함함)을 포함할 수 있다.For reference, the neural network herein may include a convolutional neural network (CNN). However, the present invention is not limited thereto, and the neural network to which the present application is applied includes various neural networks that have been previously known or developed in the future, such as a recurrent neural network (RNN), which includes a trained neural network, an untrained neural network, etc.) may include.

한편, 프루닝(Pruning)이란 신경망의 입력으로 들어온 값에 학습한 가중치를 곱해서 결과를 추론할 때 모든 가중치의 값이 결과 추론에 큰 영향을 미치지 않게 된다. 따라서 신경망을 구성하고 있는 다양한 가중치의 값들 중에 일부를 없애도 추론 결과의 정확도에 영향을 미치지 않기 때문에 이를 활용해서 신경망의 크기를 줄이거나 연산양을 줄이는데 활용하는 기법을 의미할 수 있다. 이러한 기법으로 임계값 이하의 모든 가중치를 일률적으로 제거하는 방식이 있으나 이러한 방식은 불규칙한 희소성을 갖기 때문에 신경망의 크기는 줄어들지만, 특별한 연산 기법이나 하드웨어가 없는 경우 연산 속도의 향상을 얻을 수 없다는 한계가 있다. 또 다른 기법으로서 특정한 가중치 그룹을 제거하는 기법은 연산상의 속도 향상을 얻을 수 있지만, 결과에 영향을 미치는 가중치도 제거되기 쉽기 때문에, 추론 결과 정확도의 감소가 크다는 한계가 있다.On the other hand, in pruning, when inferring a result by multiplying a value received as an input of a neural network by a learned weight, the values of all weights do not significantly affect the result inference. Therefore, even if some of the values of various weights constituting the neural network are removed, the accuracy of the inference result is not affected, so it can be used to reduce the size of the neural network or reduce the amount of computation. There is a method that uniformly removes all weights below the threshold with this method, but this method reduces the size of the neural network due to irregular sparsity, but there is a limitation that the computation speed cannot be improved if there is no special computation technique or hardware. have. As another technique, the technique of removing a specific weight group can improve computational speed, but since the weight affecting the result is also easily removed, there is a limit in that the accuracy of the inference result is greatly reduced.

이하에서는 먼저 도 3내지 도 6을 참조하여 본원의 제1실시예에 따른 프루닝 장치(100)의 기능 및 동작에 대해 설명하도록 한다.Hereinafter, the function and operation of the pruning apparatus 100 according to the first embodiment of the present application will be described with reference to FIGS. 3 to 6 first.

도 3 및 도 4는 신경망의 입력 피쳐맵과 가중치 커널을 GEMM 변환하는 프로세스를 나타낸 개념도이다.3 and 4 are conceptual diagrams illustrating a process of transforming an input feature map and a weight kernel of a neural network into a GEMM.

도 3 및 도 4를 참조하면, 가중치 변환부(110)는 신경망의 가중치 커널을 GEMM 변환할 수 있다. 참고로, GEMM(General Matrix Multiply) 변환은 오늘날 딥 러닝 추론 엔진에서 널리 사용되는 행렬 변환 방식으로, 본원은 이러한 GEMM 변환을 활용한 변환 행렬을 기반으로 프루닝을 수행함으로써 종래의 프루닝 기법이 CSR(Compressed Sparse Row), CSC(Compressed Sparse Column), COO(Coordinate list) 등의 특수한 희소 행렬(Sparse Matrix)을 활용하기 때문에 다른 종류의 프루닝 기법과 함께 중첩하여 적용될 수 없었던 문제점을 해결하였다.3 and 4 , the weight transform unit 110 may perform GEMM transformation on the weight kernel of the neural network. For reference, GEMM (General Matrix Multiply) transformation is a matrix transformation method widely used in today's deep learning inference engines. In this paper, by performing pruning based on a transformation matrix utilizing such GEMM transformation, the conventional pruning technique is CSR Because special sparse matrices such as (Compressed Sparse Row), CSC (Compressed Sparse Column), and COO (Coordinate List) are used, the problem that could not be applied by overlapping with other types of pruning techniques was solved.

이와 관련하여, 신경망을 이루는 각각의 레이어(예를 들면, CNN의 컨볼루션 레이어)에서는 행렬 간의 단순 곱셈이 아닌, 입력되는 이미지 또는 특징 맵에 대하여 소정의 간격(예를 들면, 기 설정된 스트라이드 단위)으로 순차적으로 가중치 커널이 슬라이딩되면서 복수의 값들 간의 곱셉 및 합산이 이루어지는 복잡한 연산이 여러 번 처리되게 되는데, 이러한 복잡한 컨볼루션 연산을 일반적인 행렬 곱셈 방식으로 처리할 수 있도록 큐브 형태의 3차원 특징 맵이나 가중치 커널을 2차원의 행렬로 변환하기 위하여 상술한 GEMM 변환이 활용될 수 있다.In this regard, in each layer constituting the neural network (eg, the convolutional layer of CNN), a predetermined interval (eg, a preset stride unit) with respect to an input image or feature map, rather than simple multiplication between matrices. As the weight kernel sequentially slides, a complex operation that multiplies and sums multiple values is processed several times. In order to transform the kernel into a two-dimensional matrix, the above-described GEMM transform may be utilized.

보다 구체적으로, 도 4를 참조하면, 가중치 변환부(110)는 합성곱 연산을 일반 행렬곱 연산으로 변환하여 수행할 수 있도록 3차원의 입력 특징 맵의 2차원 Im2Col(Image to Column) 행렬 변환 및 합성곱 층의 가중치 필터에 대한 1차원 벡터(1-Dimension vectorize) 변환을 수행할 수 있다. 여기서, Im2Col 변환은 3차원 입력 특징 맵에 대하여 수행되며, 이 과정에서 입력 특징 맵과 합성곱 층의 가중치가 연산될 때의 모든 이미지 패치가 복사될 수 있으며, 이러한 Im2Col 과정을 거치게 되면, 3차원 입력 특징 맵은 도 4에 도시된 바와 같이 2차원 행렬로 변환될 수 있다. 또한, 가중치 변환부(110)는 합성곱 층의 3차원 가중치 커널에 대한 1차원 벡터 변환을 수행할 수 있다.More specifically, referring to FIG. 4 , the weight transform unit 110 converts a two-dimensional Im2Col (Image to Column) matrix of a three-dimensional input feature map to perform a convolution operation by converting it into a general matrix product operation, and One-dimensional vectorize transformation may be performed on the weight filter of the convolutional layer. Here, Im2Col transformation is performed on the three-dimensional input feature map, and all image patches when the weights of the input feature map and the convolution layer are calculated in this process can be copied. The input feature map may be transformed into a two-dimensional matrix as shown in FIG. 4 . Also, the weight transform unit 110 may perform one-dimensional vector transformation on the three-dimensional weight kernel of the convolutional layer.

달리 말해, 신경망(특히, 합성곱 신경망)의 대부분의 연산량을 차지하고 있는 합성곱 연산은, 일반적으로 별도의 변환 과정을 통해 일반 행렬곱 연산(General Matrix Multiplication, GEMM)으로 수행됨으로써 고도로 최적화된 행렬곱 라이브러리를 사용하여 합성곱 연산을 수행할 수 있으며, GPU(200)와 같은 대규모의 병렬 처리 시스템에서는 이하에서 후술하는 블록화(Blocking) 및 타일링(Tiling)을 활용하여 일반 행렬곱 연산이 보다 효율적으로 수행될 수 있다.In other words, the convolution operation, which occupies most of the computational amount of neural networks (especially convolutional neural networks), is generally performed as a general matrix multiplication (GEMM) through a separate transformation process, which is highly optimized matrix multiplication. A convolution operation can be performed using a library, and in a large-scale parallel processing system such as the GPU 200, the general matrix product operation is performed more efficiently by utilizing blocking and tiling, which will be described later. can be

도 5는 그래픽 처리 장치의 계층 구조 및 GEMM 변환된 가중치 커널을 이용한 행렬 곱셈 연산의 분할 단위를 설명하기 위한 개념도이다.5 is a conceptual diagram for explaining a division unit of a matrix multiplication operation using a hierarchical structure of a graphic processing device and a GEMM-transformed weight kernel.

도 5를 참조하면, GPU(200)는 복수의 연산 유닛(Compute Unit, 211)을 포함할 수 있다. 또한, 복수의 연산 유닛(211) 각각은 복수의 연산 처리 소자(Processing Element, 212)를 포함할 수 있다.Referring to FIG. 5 , the GPU 200 may include a plurality of computation units 211 . In addition, each of the plurality of arithmetic units 211 may include a plurality of arithmetic processing elements 212 .

구체적으로, GPU(200)에 포함된 연산 처리 소자(212) 각각은 실제 연산이 이루어지는 연산 처리의 단위일 수 있으며, 연산 유닛(211)은 복수 개의 연산 처리 소자(212)의 집합일 수 있다. 이러한 GPU(200)을 통해 수행되는 작업(태스크)과 관련하여, 하나의 연산 처리 소자(212)에서 수행되는 단일 스레드(thread)는 워크 아이템(work item)으로 지칭될 수 있으며, 하나의 연산 유닛(211)에서 병렬적으로 수행되는 워크 아이템의 집합은 워크 그룹(work group)으로 지칭될 수 있다.Specifically, each of the operation processing elements 212 included in the GPU 200 may be a unit of operation processing in which an actual operation is performed, and the operation unit 211 may be a set of a plurality of operation processing elements 212 . With respect to the work (task) performed through the GPU 200 , a single thread performed by one operation processing element 212 may be referred to as a work item, and one operation unit A set of work items performed in parallel in step 211 may be referred to as a work group.

도 5를 참조하여 이해를 돕기 위해 예시하면, 신경망 내 어느 하나의 레이어(예를 들면, 합성곱 레이어)에서 수행되는 행렬 곱셈 연산(달리 말해, 일반 행렬곱 연산; GEMM)의 출력인 Output(O')에서 각각의 work group으로 구획된 영역은 GPU(200)에 포함된 각각의 연산 유닛(211)이 개별적으로 연산한 출력 파트로 나뉘는 것일 수 있다.5 for better understanding, Output(O '), the area partitioned into each work group may be divided into output parts individually calculated by each operation unit 211 included in the GPU 200 .

또한, GPU(200)의 계층적 메모리 구조와 관련하여 도 5를 참조하면, GPU(200)는 글로벌 메모리(221), 복수의 로컬 메모리(222) 및 복수의 프라이빗 메모리(223)를 포함할 수 있다.In addition, referring to FIG. 5 in relation to the hierarchical memory structure of the GPU 200 , the GPU 200 may include a global memory 221 , a plurality of local memories 222 , and a plurality of private memories 223 . have.

구체적으로, GPU(200)는 도 5에 도시된 바와 같이 다양한 메모리 계층 구조를 가지며, 글로벌 메모리(221)는 GPU(200) 내부의 오프-칩(off-chip) DRAM 메모리 영역을 의미할 수 있으며, GPU(200) 내부 메모리 중 가장 큰 영역을 차지할 수 있다. Specifically, the GPU 200 has various memory hierarchical structures as shown in FIG. 5 , and the global memory 221 may mean an off-chip DRAM memory area inside the GPU 200 , , it may occupy the largest area of internal memory of the GPU 200 .

또한, 로컬 메모리(222)는 복수의 연산 유닛(211) 각각에 대응하여 존재하는 온-칩(on-chip) SRAM 메모리 영역을 의미할 수 있으며, 어느 하나의 연산 유닛(211) 내부에 마련되는 복수의 연산 처리 소자(212)가 공동으로 사용하는 공유 메모리(Shared memory) 영역일 수 있다.In addition, the local memory 222 may refer to an on-chip SRAM memory area existing corresponding to each of the plurality of arithmetic units 211 , and is provided in any one arithmetic unit 211 . It may be a shared memory area commonly used by the plurality of arithmetic processing elements 212 .

또한, 프라이빗 메모리(223)는 연산 처리 소자(212)마다 존재하는 레지스터 메모리 영역으로서, 다른 연산 처리 소자(212)와 메모리 영역이 공유되지 않는 특성을 가질 수 있다.In addition, the private memory 223 is a register memory region that exists for each arithmetic processing element 212 and may have a characteristic that the memory region is not shared with other arithmetic processing elements 212 .

이러한, GPU(200)의 계층적 메모리 구조와 관련하여, 외부의 호스트(Host)에서 소정의 GPU Kernel(GPU(200)를 통해 수행되는 명령어의 집합)을 실행하게 되면, 외부 호스트 메모리로부터 GPU(200) 내부의 글로벌 메모리(221)로 GPU Kernel에서 필요한 데이터 복사가 진행되고, 글로벌 메모리(221)로부터 각 연산 처리 소자(212)의 동작에 필요한 데이터가 복사되며 명령어가 실행되는 것일 수 있다. 또한, 글로벌 메모리(221), 로컬 메모리(222), 프라이빗 메모리(223) 순으로 메모리 크기가 작아지며, 메모리 접근 속도가 빠른 것일 수 있다.In relation to the hierarchical memory structure of the GPU 200, when a predetermined GPU kernel (a set of instructions performed through the GPU 200) is executed in an external host, the GPU ( 200) Data required in the GPU kernel is copied to the internal global memory 221 , and data required for the operation of each operation processing element 212 is copied from the global memory 221 , and an instruction is executed. In addition, the size of the memory decreases in the order of the global memory 221 , the local memory 222 , and the private memory 223 , and the memory access speed may be fast.

한편, GPU(200)에서 수행되는 일반 행렬곱 연산(GEMM 연산)은 GPU(200) 내부의 복수의 연산 처리 소자(212)를 활용하여 병렬적으로 수행되며, 각 Work item은 일반 행렬곱 연산(GEMM 연산)을 수행할 출력 특징 맵 영역을 나눠서 담당하며, 할당된 각 영역에 대하여 일반 행렬곱 연산을 진행하게 된다. GEMM 변환된 2차원 입력 특징 맵과 합성곱 가중치 행렬은 GPU Kernel이 수행되면 모두 글로벌 메모리(221) 영역에 복사되며, 글로벌 메모리(221)에 위치(복사)한 입력 특징 맵과 가중치 커널에 대하여 단일 워크 아이템에 대응하는 외적 연산을 통해 출력 특징 맵 영역이 계산되는 과정이 전체 워크 아이템에 대하여 각 연산 처리 소자(212)에서 병렬적으로 수행될 수 있으며, 모든 워크 그룹에 대한 외적 연산이 수행되면 전체 출력 특징 맵에 대한 일반 행렬곱 연산이 계산되는 것일 수 있다.On the other hand, the general matrix multiplication operation (GEMM operation) performed in the GPU 200 is performed in parallel using a plurality of operation processing elements 212 inside the GPU 200, and each work item is a general matrix product operation ( GEMM operation) is divided and responsible for the output feature map area, and general matrix multiplication operation is performed for each allocated area. Both the GEMM-transformed two-dimensional input feature map and the convolution weight matrix are copied to the global memory 221 area when the GPU kernel is performed, and a single input feature map and weight kernel located (copied) in the global memory 221 are single. The process of calculating the output feature map area through the cross product operation corresponding to the work item may be performed in parallel in each operation processing element 212 for all work items. A general matrix multiplication operation for the output feature map may be calculated.

도 6은 GEMM 변환된 가중치 커널을 이용한 행렬 곱셈 연산과 연계된 블록화/타일링 프로세스를 나타낸 개념도이다.6 is a conceptual diagram illustrating a blocking/tiling process associated with a matrix multiplication operation using a GEMM-transformed weight kernel.

도 6을 참조하면, GPU(200)의 복수의 연산 처리 소자(212)를 통해 병렬적으로 수행되는 외적 연산 과정은 블록화(Blocking) 및 타일링(Tiling) 과정을 거쳐 로컬 메모리(222)와 프라이빗 메모리(223)에 부분적으로 메모리 복사된 후 수행되며, 이러한 블록화/타일링 과정을 통해 나뉜 각 블록은 로컬 메모리(222)에 복사되며, 복사된 블록은 다시 여러 개의 타일(미세 블록)로 나뉘어 프라이빗 메모리(223)에 복사될 수 있다. 이렇듯 로컬 메모리(222)에 복사된 입력 특징 맵과 가중치 커널의 블록에 대한 외적 연산은 GPU(200) 내부의 각 연산 유닛(211)에서 워크 그룹에 의해 분할 수행되며 프라이빗 메모리(223)에 복사된 입력 특징 맵과 가중치 커널은 GPU(200) 내부의 각 연산 처리 소자(212)에서 워크 아이템에 의해 수행되는 것일 수 있다.Referring to FIG. 6 , the external operation process performed in parallel through the plurality of operation processing elements 212 of the GPU 200 is a local memory 222 and a private memory through blocking and tiling processes. This is performed after being partially copied to the memory 223, and each block divided through this blocking/tiling process is copied to the local memory 222, and the copied block is again divided into several tiles (fine blocks) to the private memory ( 223) can be copied. As such, the cross product operation on the block of the input feature map and the weight kernel copied to the local memory 222 is divided by the work group in each operation unit 211 inside the GPU 200 and copied to the private memory 223 . The input feature map and the weight kernel may be performed by a work item in each arithmetic processing element 212 inside the GPU 200 .

프루닝부(120)는 GPU(200)의 복수의 연산 유닛(211)을 통해 병렬적으로 수행되는 행렬 곱셈 연산(일반 행렬곱 연산)의 분할 단위인 '블록'을 단위로 하여 GEMM 변환된 가중치 커널을 프루닝할 수 있다. 여기서, 본원의 제1실시예에 따른 프루닝 단위인 '블록'은 GEMM 변환에 기반한 행렬 곱셈 연산을 수행하기 위하여 글로벌 메모리(221)로부터 각 연산 유닛(211)의 로컬 메모리(222)로 복사되는 GEMM 변환된 가중치 커널의 구획된 일부분에 대응하는 것일 수 있다.The pruning unit 120 is a GEMM-transformed weight kernel using a 'block', which is a division unit of a matrix multiplication operation (general matrix multiplication operation) performed in parallel through a plurality of operation units 211 of the GPU 200 as a unit. can be pruned. Here, the 'block', which is the pruning unit according to the first embodiment of the present application, is copied from the global memory 221 to the local memory 222 of each operation unit 211 in order to perform a matrix multiplication operation based on the GEMM transformation. It may correspond to a partitioned part of the GEMM-transformed weight kernel.

본원의 제1실시예에 따르면, GPU(200)의 복수의 연산 유닛(211) 각각이 미리 설정된 복수 개의 블록에 대한 행렬 곱셈 연산을 수행하는 경우, 프루닝부(120)는 복수의 연산 유닛(211) 각각과 연계된 복수 개의 블록 중 미리 설정된 비율의 블록을 연산 유닛(211)마다 동등한 비율로 제거할 수 있다.According to the first embodiment of the present application, when each of the plurality of operation units 211 of the GPU 200 performs a matrix multiplication operation on a plurality of preset blocks, the pruning unit 120 includes the plurality of operation units 211 . ) of a plurality of blocks associated with each of the blocks at a preset ratio may be removed for each operation unit 211 at an equal ratio.

달리 말해, 프루닝부(120)는 복수의 연산 유닛(211) 각각에 대응하는 워크 그룹이 프루닝 후에 연산 작업 부하를 균등하게 유지하도록 개별 워크 그룹으로부터 동일한 양의 블록을 제거할 수 있다. 달리 말해, 각각의 연산 유닛(211)이 할당된 워크 그룹을 통해 처리하는 블록의 수가

개 일 때, 프루닝 비율 p를 적용하여 프루닝 후의 각 연산 유닛(211)에 대한 잔여 블록의 수가

개가 되도록 할 수 있다.In other words, the pruning unit 120 may remove the same amount of blocks from the individual work groups so that the work groups corresponding to each of the plurality of operation units 211 equally maintain the computational workload after pruning. In other words, the number of blocks that each operation unit 211 processes through an assigned work group

, the number of remaining blocks for each arithmetic unit 211 after pruning by applying the pruning ratio p

You can make it a dog.

이와 관련하여, 프루닝부(120)가 본원의 제1실시예에 따라 프루닝하는 단위가 블록 단위로 결정됨으로써 프루닝에 의해 제거된 블록은 신경망의 추론 과정에서 글로벌 메모리(221)로부터 로컬 메모리(222)로 미복사되는 것일 수 있다. 달리 말해, 본원의 제1실시예에 따르면 GPU(200)의 계층적 구조 및 GPU(200)를 통한 병렬 연산 수행시의 데이터(분할/구획된 가중치)가 복사되는 단위와 프루닝 단위를 대응시킴으로써 프루닝에 의해 제거된 블록의 메모리 간의 복사 프로세스 자체가 생략될 수 있으므로, 프루닝을 통한 실질적인 연산 속도 향상을 도모할 수 있는 것이다.In this regard, as the unit to be pruned by the pruning unit 120 is determined in block units according to the first embodiment of the present application, the blocks removed by pruning are stored from the global memory 221 to the local memory ( 222) may not be copied. In other words, according to the first embodiment of the present application, the hierarchical structure of the GPU 200 and the unit in which data (divided/divided weights) are copied when parallel operation is performed through the GPU 200 are matched with the pruning unit. Since the copy process itself between the memories of the blocks removed by pruning can be omitted, it is possible to substantially improve the operation speed through pruning.

구체적으로, 본원의 제1실시예에 따르면, 프루닝부(120)는 복수의 연산 유닛(211) 각각에 대하여 해당 연산 유닛(211)에 할당된 복수의 블록 각각에 대한 가중치 중요도를 연산할 수 있다.Specifically, according to the first embodiment of the present application, the pruning unit 120 may calculate the weight importance for each of the plurality of blocks allocated to the corresponding operation unit 211 for each of the plurality of operation units 211 . .

구체적으로, 프루닝부(120)는 하기 식 1에 기초하여 l2-norm 값에 기반한 블록 각각에 대한 가중치 중요도를 연산할 수 있다.Specifically, the pruning unit 120 may calculate the weight importance for each block based on the l2-norm value based on Equation 1 below.

[식 1][Equation 1]

여기서, l2_k는 각 블록의 가중치 중요도인 l2-norm 값으로서, 프루닝부(120)는 연산된 가중치 중요도에 기초하여 미리 설정된 비율(p)로 제거할 블록을 복수의 연산 유닛(211) 각각에 대하여 결정하도록 l2-norm 값이 가장 높은 상위

개만큼의 블록만이 남도록 프루닝을 수행할 수 있다. 즉, 상위

개에 포함되지 않은 하위 가중치 블록은 값이 모두 0으로 대체될 수 있다.Here, each of l2 _k is a l2-norm value a weight the importance of each block, fruit ningbu 120 ratio previously set based on the calculated weighted priority (p) a plurality of operation unit 211, the block to be removed as a top with the highest l2-norm value to determine

Pruning can be performed so that only as many blocks are left. that is, upper

All lower weight blocks not included in the dog can be replaced with 0 values.

종합하면, 본원의 제1실시예에 따른 프루닝 장치(100)는 GEMM 변환 기반의 일반 행렬곱 연산을 수행하기 위한 블록화/타일링 과정에서 분할된 '블록'을 프루닝의 단위로 하여 제거될 가중치를 선정하는 블록 단위 프루닝을 수행할 수 있다.In summary, the pruning apparatus 100 according to the first embodiment of the present application uses a 'block' divided in the block/tiling process for performing a GEMM transform-based general matrix multiplication operation as a unit of pruning and a weight to be removed It is possible to perform block unit pruning to select .

한편, 본원의 제1실시예에 따른 블록 단위 프루닝을 적용하여 제거된 가중치 영역은, GPU Kernel 내부에서 간단한 블록 위치 메타데이터(Block location metadata)로서 식별됨으로써 로컬 메모리에 대한 복사 과정 및 연산 과정이 생략되는 것일 수 있다. 이와 관련하여, 프루닝부(120)는 블록 단위 프루닝을 통해 제거된 가중치 영역의 가중치 값을 모두 0으로 대체하고, 제거된 가중치 영역에 대한 메타데이터인 블록 위치 메타데이터를 생성할 수 있다. 즉, 블록 위치 메타데이터는, 제거되지 않은 블록의 위치 정보를 담고 있으며, 이를 생성하기 위해 프루닝부(120)는 먼저 이진화 마스킹된 프루닝 메타데이터(Pruning metadata)를 생성하여 프루닝 메타데이터가 제거된 블록에 대하여는 0의 값을 가지도록 하고, 제거되지 않은 블록 사이의 0의 개수를 카운트하는 프로세스를 통해 블록 위치 메타데이터를 생성할 수 있다. 이 블록 위치 메타데이터를 GPU Kernel에 넘겨주면 신경망의 추론 과정에서 GPU(200)는 간단하게 프루닝된 가중치를 배제하고 실제로 연산을 수행할 위치를 용이하게 파악할 수 있다.On the other hand, the weight region removed by applying block unit pruning according to the first embodiment of the present application is identified as simple block location metadata within the GPU kernel, so that the copying process and calculation process for the local memory are performed. may be omitted. In this regard, the pruning unit 120 may replace all weight values of the weighted region removed through block-by-block pruning with 0, and generate block position metadata that is metadata for the removed weighted region. That is, the block location metadata contains location information of blocks that have not been removed, and in order to generate this, the pruning unit 120 first generates binarized masked pruning metadata to remove the pruning metadata. Block position metadata can be generated through a process of setting a block to have a value of 0 and counting the number of 0s between blocks that have not been removed. When this block location metadata is passed to the GPU kernel, the GPU 200 can easily determine the location where the actual operation is to be performed by simply excluding the pruned weight in the inference process of the neural network.

회복 재훈련부(130)는 프루닝된 신경망 내의 노드 간 가중치 커널 중 적어도 일부를 회복시키고, 회복된 노드 간 가중치 커널에 대한 재훈련을 수행할 수 있다.The recovery retraining unit 130 may recover at least a portion of the inter-node weight kernel in the pruned neural network, and retrain the recovered inter-node weight kernel.

도 7은 본원의 제2실시예에 따른 그래픽 처리 장치의 구조를 고려한 신경망 프루닝 장치의 개략적인 구성도이다.7 is a schematic configuration diagram of a neural network pruning apparatus in consideration of the structure of a graphic processing apparatus according to a second embodiment of the present application.

도 7을 참조하면, 프루닝 장치(100')는 가중치 변환부(110'), 프루닝부(120') 및 회복 재훈련부(130')를 포함할 수 있다. 한편, 가중치 변환부(110')와 회복 재훈련부(130')의 경우 앞서 설명한 본원의 제1실시예에 따른 프루닝 장치(100)의 가중치 변환부(110) 및 회복 재훈련부(130')와 동일한 기능을 수행하는 것으로 이해될 수 있으므로, 이하에서는 프루닝 장치(100')의 프루닝부(120')의 기능을 중점적으로 설명하도록 한다.Referring to FIG. 7 , the pruning apparatus 100 ′ may include a weight conversion unit 110 ′, a pruning unit 120 ′, and a recovery retraining unit 130 ′. On the other hand, in the case of the weight conversion unit 110' and the recovery retraining unit 130', the weight conversion unit 110 and the recovery retraining unit 130' of the pruning apparatus 100 according to the first embodiment of the present application described above. Since it can be understood to perform the same function as , hereinafter, the function of the pruning unit 120' of the pruning apparatus 100' will be mainly described.

프루닝부(120')는 복수의 연산 처리 소자(212)를 통해 병렬적으로 수행되는 행렬 곱셈 연산의 분할 단위인 '미세 블록'을 단위로 하여 가중치 변환부(110')에 의해 GEMM 변환된 가중치 커널을 프루닝할 수 있다. 여기서, 본원의 제2실시예에 따른 프루닝 단위인 '미세 블록'은 GEMM 변환에 기반한 행렬 곱셈 연산을 수행하기 위하여 글로벌 메모리(221)로부터 각 연산 유닛(211)의 로컬 메모리(222)로 복사되는 GEMM 변환된 가중치 커널의 구획된 일부분인 블록에 포함된 각각의 가중치 행(row)에 대응하는 것일 수 있다.The pruning unit 120' uses the 'fine block', which is a division unit of the matrix multiplication operation performed in parallel through the plurality of operation processing elements 212, as a unit, and the weights converted to the GEMM by the weight conversion unit 110'. You can prune the kernel. Here, the 'fine block', which is the pruning unit according to the second embodiment of the present application, is copied from the global memory 221 to the local memory 222 of each operation unit 211 in order to perform a matrix multiplication operation based on the GEMM transformation. It may correspond to each weight row included in a block that is a partitioned part of the GEMM-transformed weight kernel.

즉, 본원의 제2실시예에 따른 프루닝 장치(100')는 본원의 제1실시예에 따른 프루닝 장치(100) 대비 프루닝 단위를 보다 작은 단위인 미세 블록으로 설정하여 상대적으로 세밀하게 제거될 가중치를 결정하는 프루닝 기법을 적용할 수 있다.That is, the pruning apparatus 100 ′ according to the second embodiment of the present application sets the pruning unit as a smaller unit, that is, a fine block, compared to the pruning apparatus 100 according to the first embodiment of the present application, so that it is relatively fine. A pruning technique that determines the weight to be removed can be applied.

본원의 제2실시예에 따르면, 프루닝부(120')는 복수의 연산 처리 소자(212) 각각과 연계된 복수 개의 미세 블록 중 미리 설정된 비율의 미세 블록을 개별 연산 처리 소자(212)마다 제거할 수 있다. 이렇듯, 프루닝부(120')에 의해 연산 처리 소자(212) 각각에 할당된 복수의 블록으로부터 동등한 비율의 미세 블록이 제거됨으로써 복수의 연산 처리 소자(212) 각각의 추론 단계에서의 연산 부하가 동등하게 결정될 수 있다.According to the second embodiment of the present application, the pruning unit 120 ′ removes for each individual arithmetic processing element 212 a predetermined ratio of fine blocks among a plurality of fine blocks associated with each of the plurality of arithmetic processing elements 212 . can As such, the pruning unit 120 ′ removes fine blocks of equal proportion from the plurality of blocks allocated to each of the arithmetic processing elements 212 , so that the arithmetic load in the reasoning step of each of the plurality of arithmetic processing elements 212 is equal. can be decided

이와 관련하여, 각 연산 유닛(211)이 담당하는 워크 그룹은 복수 개의 연속된 블록을 포함하며, 각 연산 유닛(211)은 연속된 블록을 반복적으로 하나씩 처리할 수 있다. 또한, 복수 개의 블록 각각은 복수 개의 미세 블록(micro block)을 포함할 수 있다. 보다 구체적으로 예시하면, 각 워크 그룹이

개의 블록을 포함하고 각각의 블록이

개의 미세 블록으로 이루어질 때 단일 워크 그룹과 관련하여

개의 프루닝 후보가 존재하는 것으로 이해될 수 있다.In this regard, the work group in charge of each operation unit 211 includes a plurality of consecutive blocks, and each operation unit 211 may repeatedly process the consecutive blocks one by one. In addition, each of the plurality of blocks may include a plurality of micro blocks. More specifically, each work group

contains blocks, and each block is

With respect to a single workgroup when made up of two microblocks

It can be understood that there are two pruning candidates.

또한, 모든 워크 그룹에 대하여 동일한 프루닝 비율이 적용되어 연산 처리 소자(212) 간의 균등한 연산 부하(오버헤드)가 유지될 수 있으므로, 미리 설정된 프루닝 제거 비율을 p라할 때, 각각의 워크 그룹에 대하여

개의 미세 블록이 제거되는 것일 수 있다.In addition, since the same pruning ratio is applied to all work groups so that an even computational load (overhead) between the arithmetic processing elements 212 can be maintained, when the preset pruning removal ratio is p, each work group about

It may be that the number of fine blocks is removed.

도 8은 본원의 제2실시예에 따른 그래픽 처리 장치의 구조를 고려한 신경망 프루닝 장치에 의해 수행되는 미세 블록을 단위로 한 프루닝 기법을 설명하기 위한 개념도이다.8 is a conceptual diagram for explaining a pruning technique in units of fine blocks performed by a neural network pruning device considering the structure of a graphic processing device according to a second embodiment of the present application.

도 8을 참조하면, 프루닝부(120')는 워크 그룹 각각이 동일한 연산 부하를 가지는 것과 대응되게 워크 그룹에 포함된 워크 아이템 역시 동등한 연산 부하로 복수의 연산 처리 소자(212)에 할당될 수 있도록 개별 워크 아이템에 대응하는 복사본 영역에서 제거되는 미세 블록의 수를 연산 처리 소자(212)마다 동등하게 유지해주어야 한다.Referring to FIG. 8 , the pruning unit 120 ′ is configured so that the work items included in the work group can be allocated to the plurality of computation processing elements 212 with the same computational load so that each work group has the same computational load. The number of fine blocks removed from the copy area corresponding to the individual work item must be equally maintained for each operation processing element 212 .

이를 위하여, 프루닝부(120')는 단일 워크 그룹에 대응하여 처리되는

개의 미세 블록을 각 미세 블록을 처리하는 연산 처리 소자(212)를 기준으로 그룹핑(재배열)한 후

개의 그룹으로 그룹핑된 각 그룹에서 같은 수의 미세 블록이 제거되도록 제거할 그룹별로 제거할 미세 블록을 결정할 수 있다.To this end, the pruning unit 120' is processed corresponding to a single work group.

After grouping (rearranging) the microblocks based on the arithmetic processing element 212 that processes each microblock

The microblocks to be removed may be determined for each group to be removed so that the same number of microblocks are removed from each group grouped into groups.

보다 구체적으로, 도 8의 우측 상단의 [Block] 구조를 참조하면, 하나의 블록을 이루는 복수 개의 미세 블록 각각은 개별 연산 처리 소자(212)에서 하나의 워크 아이템으로서 처리되는 것일 수 있다. 예를 들어, 도 8의 우측 상단에 도시된 블록의 첫번째 행에 해당하는 미세 블록은 첫번째 연산 처리 소자(212)에서 처리되는 워크 아이템에 대응하고, 블록의 두번째 행에 해당하는 미세 블록은 두번째 연산 처리 소자(212)에서 처리되는 워크 아이템에 대응하고, 나아가 블록의 마지막 행(

번째)에 해당하는 미세 블록은

번째 연산 처리 소자(212)에서 처리되는 워크 아이템에 대응하는 것일 수 있다.More specifically, referring to the [Block] structure in the upper right of FIG. 8 , each of the plurality of fine blocks constituting one block may be processed as one work item by the individual arithmetic processing element 212 . For example, the fine block corresponding to the first row of the block shown in the upper right of FIG. 8 corresponds to the work item processed by the first operation processing element 212 , and the fine block corresponding to the second row of the block corresponds to the second operation Corresponding to the work item being processed by the processing element 212, furthermore, the last row of the block (

The microblock corresponding to the second) is

It may correspond to the work item processed by the th operation processing element 212 .

또한, 도 8의 우측 하단의 [Load balancing in micro-block pruning] 파트를 참조하면, 프루닝부(120')는 복수 개의 블록 각각에서 동일한 위치에 배치된 행을 하나의 그룹으로 그룹핑하여

개의 그룹으로 재배열하고,

개의 그룹에 각각 포함된

개의 미세 블록 중 동일한 비율로 미세 블록을 제거하여 연산 처리 소자(212) 각각이 처리하는 미세 블록의 수가 프루닝 후 동등하게 유지되도록 할 수 있는 것이다.In addition, referring to the [Load balancing in micro-block pruning] part at the lower right of FIG. 8 , the pruning unit 120 ′ groups rows disposed at the same position in each of a plurality of blocks into one group.

rearranged into groups of dogs,

included in each group

The number of microblocks processed by each of the arithmetic processing elements 212 can be maintained equally after pruning by removing the microblocks at the same ratio among the number of microblocks.

본원의 제2실시예에 따르면, 프루닝부(120')는 복수의 연산 처리 소자(212) 각각과 연계된 복수 개의 미세 블록 각각에 대한 가중치 중요도를 연산할 수 있다.According to the second embodiment of the present application, the pruning unit 120 ′ may calculate the weight importance for each of the plurality of fine blocks associated with each of the plurality of arithmetic processing elements 212 .

이와 관련하여, 프루닝부(120')는 하기 식 2에 기초하여 l2-norm 값에 기반한 미세 블록 각각에 대한 가중치 중요도를 연산할 수 있다.In this regard, the pruning unit 120' may calculate the weight importance for each of the fine blocks based on the l2-norm value based on Equation 2 below.

[식 2][Equation 2]

여기서, l2_{i, j}는 각 미세 블록의 가중치 중요도인 l2-norm 값으로서, 프루닝부(120')는 연산된 가중치 중요도에 기초하여 미리 설정된 비율(p)로 제거할 미세 블록을 복수의 연산 처리 소자(212) 각각에 대응하여 결정하도록 l2-norm 값이 높은 순으로 미세 블록 단위 프루닝을 수행할 수 있다. 즉, 제거되는 하위 가중치 미세 블록은 값이 모두 0으로 대체될 수 있다.Here, l2 _{i, j} are l2-norm values that are weight importance of each fine block, and the pruning unit 120' performs a plurality of calculations on the fine blocks to be removed at a preset ratio p based on the calculated weight importance. Pruning may be performed in units of fine blocks in an order of increasing l2-norm values to determine corresponding to each of the devices 212 . That is, all values of the removed lower weighted fine blocks may be replaced with 0.

프루닝부(120')에 의한 미세 블록 단위 프루닝 또한 블록 단위 프루닝과 마찬가지로 메타데이터를 활용하여 로컬 메모리 복사와 일반 행렬곱 연산을 생략할 수 있다. 이러한 메타데이터의 생성을 위해, 프루닝부(120')는 프루닝-재훈련된 가중치 영역에서 제거된 미세 블록에 대응하여 0의 값을 갖는 이진화 마스크의 프루닝 메타데이터를 생성하고 메타데이터 양을 줄이기 위하여 정수 부호화(Integer encoding) 과정을 수행할 수 있다. 예를 들어, (0100 1000 0000 00012)의 프루닝 메타데이터를 8bit의 unsigned character 자료형을 사용하여 부호화하면, (73, 1)의 2개의 자료형으로 나타낼 수 있다. 예를 들어, 프루닝부(120')는 32bit 정수형을 사용하여 32개의 프루닝 메타데이터를 단일 정수로 부호화하여 사용할 수 있으며, 이렇게 부호화된 메타데이터는 GPU Kernel의 동작 내부에서 복호화 과정을 거치고, 복호화된 메타데이터는 GPU Kernel 내부에서 0의 개수를 카운트하는 연산 과정을 거쳐 각 Work item에 대하여 글로벌 메모리(221)에 대한 다음 접근 위치를 계산하도록 활용될 수 있다.The fine block unit pruning by the pruning unit 120 ′ also uses metadata similar to block unit pruning to omit the local memory copy and general matrix multiplication operation. To generate such metadata, the pruning unit 120' generates pruning metadata of a binarization mask having a value of 0 corresponding to the fine block removed from the pruning-retrained weight region, and calculates the amount of metadata. In order to reduce it, an integer encoding process may be performed. For example, if pruning metadata of (0100 1000 0000 00012) is encoded using an 8-bit unsigned character data type, it can be expressed as two data types of (73, 1). For example, the pruning unit 120' can encode and use 32 pruning metadata as a single integer using a 32-bit integer type, and the encoded metadata undergoes a decoding process inside the operation of the GPU Kernel, and is then decoded. The metadata may be utilized to calculate the next access position to the global memory 221 for each work item through an operation process of counting the number of zeros inside the GPU kernel.

회복 재훈련부(130')는 프루닝된 신경망 내의 노드 간 가중치 커널 중 적어도 일부를 회복시키고, 회복된 노드 간 가중치 커널에 대한 재훈련을 수행할 수 있다.The recovery retraining unit 130 ′ may recover at least a portion of the inter-node weight kernel in the pruned neural network, and retrain the recovered inter-node weight kernel.

종합하면, 본원의 제1실시예에 따른 프루닝 장치(100)와 본원의 제2실시예에 따른 프루닝 장치(100')는 공통적으로 GPU(200)를 통해 동작하는 신경망의 추론 과정에서 GPU(200)의 계층 구조에 포함된 다수의 하위 연산 모듈(연산 유닛(211), 연산 처리 소자(212) 등)에서 동등한 수준(비율)으로 가중치를 제거함으로써 병렬 연산을 통하여 각 하위 연산 모듈이 처리해야 하는 연산양이 균등하게 분배되도록 프루닝 함으로써 실질적인 연산 속도의 이득을 얻을 수 있게 된다.In summary, the pruning apparatus 100 according to the first embodiment of the present application and the pruning apparatus 100 ′ according to the second embodiment of the present application are common in the inference process of the neural network operating through the GPU 200 GPU. Each sub-operation module is processed through parallel operation by removing weights at the same level (ratio) from a number of lower-level operation modules (operation unit 211, operation processing element 212, etc.) included in the hierarchical structure of 200. By pruning so that the amount of computation to be performed is evenly distributed, the actual computational speed gain can be obtained.

즉, GPU(200)는 중앙 처리 장치(CPU) 대비 하위 연산 모듈이 병렬적으로 다수 개 배치되는 구조를 통해 병렬적인 연산을 빠르게 처리할 수 있는 대신 중앙 처리 장치(CPU)에 비하여 제어 유닛이 적어 개별 연산 자체는 중앙 처리 장치(CPU) 대비 오래 걸리게 되는 구조적 특징으로 인해 병렬적인 연산에 최적화되어 있는 점을 고려하여 본원에서 개시하는 프루닝 기법은 신경망의 추론 과정에서 수행되는 연산(합성곱 연산 내지 GEMM 변환에 따른 일반 행렬곱 연산)을 병렬적인 형태로 분할하여 개별 하위 연산 모듈로 할당되도록 하고, 이러한 분할/할당의 단위를 프루닝의 단위로 설정함으로써 GPU(200) 병렬적 특성에 타겟팅된 최적화된 속도의 이득을 획득할 수 있다.That is, the GPU 200 can quickly process parallel operations through a structure in which a plurality of sub-operation modules are arranged in parallel compared to the central processing unit (CPU), but has fewer control units compared to the central processing unit (CPU). Considering that the individual operation itself is optimized for parallel operation due to the structural feature that takes longer than the central processing unit (CPU), the pruning technique disclosed herein is an operation (convolution operation to General matrix multiplication operation according to GEMM transformation) is partitioned in a parallel form to be allocated to individual sub-operation modules, and the unit of division/allocation is set as a unit of pruning, so that optimization targeted to the parallel characteristics of the GPU 200 You can get the speed gain.

한편, 상기의 본원의 실시예에 관한 설명에서는 본원의 제1실시예에 따른 프루닝 기법(블록 단위 프루닝)과 본원의 제2실시예에 따른 프루닝 기법(미세 블록 단위 프루닝)을 개별적으로 설명하였으나, 본원의 구현예에 따라서는 1차적으로 본원의 제1실시예에 따른 블록 단위 프루닝을 수행한 후 제거되지 않은 가중치 커널에 대하여 후속하여 2차적으로 본원의 제2실시예에 따른 미세 블록 단위 프루닝을 수행하는 단계적인 프루닝 프로세스가 이루어질 수 있음은 물론이다.On the other hand, in the description of the embodiment of the present application above, the pruning technique (block unit pruning) according to the first embodiment of the present application and the pruning technique (fine block unit pruning) according to the second embodiment of the present application are separately performed. However, according to the embodiment of the present application, after performing block unit pruning according to the first embodiment of the present application, the weight kernel that is not removed is subsequently secondarily performed according to the second embodiment of the present application. Of course, a step-by-step pruning process of performing fine block unit pruning may be performed.

도 9a 내지 도 9c는 종래의 프루닝 기법과 본원에서 개시하는 프루닝 기법에 의한 프루닝 결과를 비교하여 나타낸 도면이다.9A to 9C are diagrams showing comparison of pruning results by the conventional pruning technique and the pruning technique disclosed herein.

도 9a 내지 도 9c에 도시된 본원과 연계된 실험예에서는 예시적으로 CIFAR100 데이터로 훈련시킨 VGG16 신경망에 본원의 제1실시예에 따른 블록 단위 프루닝과 본원의 제2실시예에 따른 미세 블록 단위 프루닝을 적용하고, 종래의 프루닝 기법과 비교하였다.In the experimental example associated with the present application shown in FIGS. 9A to 9C, block unit pruning according to the first embodiment of the present application and fine block unit according to the second embodiment of the present application to a VGG16 neural network trained with CIFAR100 data by way of example Pruning was applied and compared with the conventional pruning technique.

구체적으로, 도 9a는 50% 프루닝 비율(p=0.5)에서 각 프루닝 기법이 적용된 신경망의 3번째 합성곱 층의 가중치 값을 차원 배열로 2 도식화한 것이다. 여기서, 도 9a의 (a)는 종래의 커널-채널 프루닝이 적용된 경우를 도시하고, 도 9a의 (b)는 본원의 제1실시예에 따른 블록 단위 프루닝이 적용된 경우를 도시하고, 도 9a의 (c)는 본원의 제2실시예에 따른 미세 블록 단위 프루닝이 적용된 경우를 도시하고, 도 9a의 (d)는 종래의 가중치 프루닝이 적용된 경우를 도시한 것일 수 있다.Specifically, FIG. 9A is a diagram illustrating the weight values of the third convolutional layer of the neural network to which each pruning technique is applied at a 50% pruning ratio (p=0.5) as a two-dimensional array. Here, FIG. 9A (a) shows a case in which the conventional kernel-channel pruning is applied, and FIG. 9A (b) shows a case in which block unit pruning according to the first embodiment of the present application is applied, and FIG. 9a (c) shows a case in which the fine block unit pruning according to the second embodiment of the present application is applied, and FIG. 9a (d) shows a case in which the conventional weight pruning is applied.

도 9a의 검은 영역은 합성곱 신경망 내부 가중치가 제거된 영역으로, 0의 값을 갖는다. 3번째 합성곱 층은 128개의 64×3×3크기 필터를 가지고 있으며 단일 필터가 한 열로 정렬되도록 나타내었다. 커널-채널 프루닝의 경우 128개의 필터 중 64개의 필터가 제거된 것을 확인할 수 있으며 실제로 제거된 64번 내지 128번 필터는 신경망의 가중치가 0의 값으로 존재하지 않지만 다른 프루닝 기법의 가중치와 동일한 모양으로 비교하기 위해 0의 값으로 도식화하였다.The black region in FIG. 9A is a region from which the internal weight of the convolutional neural network has been removed, and has a value of 0. The third convolutional layer has 128 64×3×3 size filters, and a single filter is shown to be arranged in one column. In the case of kernel-channel pruning, it can be seen that 64 filters out of 128 filters have been removed, and in filters 64 to 128 that are actually removed, the weight of the neural network does not exist as a value of 0, but it is the same as the weight of other pruning techniques. It is plotted with a value of 0 for comparison in shape.

도 9b는 프루닝 기법별 정확도를 비교한 그래프로서, 도 9b의 그래프의 가로축은 각 합성곱 층에 적용한 프루닝 비율(p)이고, 세로축은 프루닝-재훈련 과정을 거친 신경망의 정확도이다. 또한, 도 9c는 프루닝 기법별 추론 속도를 비교한 그래프로서, 가로축은 추론 속도를 나타내는 지표인 초당 처리 프레임 수이며, 세로축은 신경망의 정확도를 나타낸다. 도 9c를 참조하면, 본원에서 개시하는 블록 단위 프루닝(Block pruning)과 미세 블록 단위 프루닝(Micro tile pruning)과 관련하여 모든 프루닝 비율에 대한 속도가 측정되었으며, 가중치 프루닝은 가장 희소한 신경망(p=0.9)의 속도만을 측정하였다.9B is a graph comparing the accuracy of each pruning technique. The horizontal axis of the graph of FIG. 9B is the pruning ratio (p) applied to each convolutional layer, and the vertical axis is the accuracy of the neural network that has gone through the pruning-retraining process. In addition, FIG. 9C is a graph comparing inference speeds for each pruning technique. The horizontal axis represents the number of frames processed per second, which is an index indicating the inference speed, and the vertical axis represents the accuracy of the neural network. Referring to FIG. 9C , the rates for all pruning ratios were measured in relation to block pruning and micro tile pruning disclosed herein, and weight pruning is the sparsest Only the speed of the neural network (p=0.9) was measured.

도 9b 및 도 9c를 참조하면, 본원에서 개시하는 두 프루닝 기법(블록 단위 프루닝 및 미세 블록 단위 프루닝) 모두 동일 프루닝 비율에서 구조적 프루닝 기법인 커널-채널 프루닝(Kernel-Channel pruning) 기법보다 정확도 열화가 적으며, 신경망의 초당 이미지 처리 속도가 빠른 것을 검증하였다. 또한, 두 프루닝 기법 모두 비구조적 프루닝 기법인 가중치 프루닝(Weight pruning) 기법 비해 정확도 열화는 크지만 제거된 프루닝 후보군이 프루닝 과정에서 연산 속도의 증가로 전혀 활용되지 않는 가중치 프루닝과는 달리, 본원에서 개시하는 두 가지 프루닝 기법은 낮은 프루닝 비율(p=0.25)에도 기존 신경망 대비 연산 속도 증가를 보이는 것을 확인할 수 있다.9B and 9C , both pruning techniques (block unit pruning and fine block unit pruning) disclosed herein are structural pruning techniques at the same pruning ratio Kernel-Channel pruning (Kernel-Channel pruning) ), it has less accuracy deterioration than the technique and verified that the image processing speed per second of the neural network is faster. In addition, although both pruning techniques have greater accuracy degradation compared to the weight pruning technique, which is an unstructured pruning technique, weight pruning and On the contrary, it can be seen that the two pruning techniques disclosed herein show an increase in operation speed compared to the existing neural network even at a low pruning ratio (p=0.25).

이에 따라, GPU(200)를 활용하는 시스템에 대하여 본원에서 개시하는 두 가지 프루닝 기법을 적용함으로써 기존 신경망의 높은 정확도를 유지한 채 생성된 불규칙한 희소성을 효율적으로 활용할 수 있으므로, 추론 지연시간 및 배치처리 응용에서의 추론처리율이 획기적으로 향상될 수 있다.Accordingly, by applying the two pruning techniques disclosed herein to a system utilizing the GPU 200, it is possible to efficiently utilize the generated irregular sparsity while maintaining the high accuracy of the existing neural network. The inference processing rate in processing applications can be dramatically improved.

이하에서는 상기에 자세히 설명된 내용을 기반으로, 본원의 동작 흐름을 간단히 살펴보기로 한다.Hereinafter, an operation flow of the present application will be briefly reviewed based on the details described above.

도 10은 본원의 제1실시예에 따른 그래픽 처리 장치의 구조를 고려한 신경망 프루닝 방법에 대한 동작 흐름도이다.10 is an operation flowchart of a neural network pruning method in consideration of the structure of the graphic processing apparatus according to the first embodiment of the present application.

도 10에 도시된 본원의 제1실시예에 따른 그래픽 처리 장치의 구조를 고려한 신경망 프루닝 방법은 앞서 설명된 프루닝 장치(100)에 의하여 수행될 수 있다. 따라서, 이하 생략된 내용이라고 하더라도 프루닝 장치(100)에 대하여 설명된 내용은 본원의 제1실시예에 따른 그래픽 처리 장치의 구조를 고려한 신경망 프루닝 방법에 대한 설명에도 동일하게 적용될 수 있다.The neural network pruning method in consideration of the structure of the graphic processing apparatus according to the first embodiment of the present application shown in FIG. 10 may be performed by the pruning apparatus 100 described above. Therefore, even if omitted below, the description of the pruning apparatus 100 may be equally applied to the description of the neural network pruning method in consideration of the structure of the graphic processing apparatus according to the first embodiment of the present application.

도 10을 참조하면, 단계 S11에서 가중치 변환부(110)는 복수의 연산 유닛(Compute Unit, 211)을 포함하는 그래픽 처리 장치(Graphic Processing Unit, GPU; 200)를 통해 동작하는 신경망의 가중치 커널을 GEMM 변환할 수 있다.Referring to FIG. 10 , in step S11 , the weight conversion unit 110 generates a weight kernel of a neural network that operates through a graphic processing unit (GPU) 200 including a plurality of computation units (Compute Unit) 211 . GEMM can be converted.

다음으로, 단계 S12에서 프루닝부(120)는 복수의 연산 유닛(211)을 통해 병렬적으로 수행되는 행렬 곱셈 연산의 분할 단위인 블록을 단위로 하여 단계 S11에서 GEMM 변환된 가중치 커널을 프루닝할 수 있다.Next, in step S12, the pruning unit 120 prunes the GEMM-transformed weight kernel in step S11 by using a block, which is a division unit of a matrix multiplication operation performed in parallel through a plurality of operation units 211, as a unit. can

상술한 설명에서, 단계 S11 및 S12는 본원의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 변경될 수도 있다.In the above description, steps S11 and S12 may be further divided into additional steps or combined into fewer steps, according to an embodiment of the present application. In addition, some steps may be omitted as necessary, and the order between steps may be changed.

도 11은 본원의 제2실시예에 따른 그래픽 처리 장치의 구조를 고려한 신경망 프루닝 방법에 대한 동작 흐름도이다.11 is an operation flowchart of a neural network pruning method in consideration of the structure of the graphic processing apparatus according to the second embodiment of the present application.

도 11에 도시된 본원의 제2실시예에 따른 그래픽 처리 장치의 구조를 고려한 신경망 프루닝 방법은 앞서 설명된 프루닝 장치(100')에 의하여 수행될 수 있다. 따라서, 이하 생략된 내용이라고 하더라도 프루닝 장치(100')에 대하여 설명된 내용은 본원의 제2실시예에 따른 그래픽 처리 장치의 구조를 고려한 신경망 프루닝 방법에 대한 설명에도 동일하게 적용될 수 있다.The neural network pruning method in consideration of the structure of the graphic processing apparatus according to the second embodiment of the present application shown in FIG. 11 may be performed by the above-described pruning apparatus 100'. Therefore, even if omitted below, the description of the pruning apparatus 100' may be equally applied to the description of the neural network pruning method in consideration of the structure of the graphic processing apparatus according to the second embodiment of the present application.

도 11을 참조하면, 단계 S21에서 가중치 변환부(110')는 각각이 복수의 연산 처리 소자(Processing Element, PE; 212)를 포함하는 복수의 연산 유닛(Compute Unit, 211)을 포함하는 그래픽 처리 장치(Graphic Processing Unit, GPU; 200)를 통해 동작하는 신경망의 가중치 커널을 GEMM 변환할 수 있다.Referring to FIG. 11 , in step S21 , the weight converter 110 ′ performs graphic processing including a plurality of computation units (Compute Units) 211 each including a plurality of processing elements (PEs) 212 . A weight kernel of a neural network operating through a device (Graphic Processing Unit, GPU; 200) may be transformed into a GEMM.

다음으로, 단계 S22에서 프루닝부(120')는 복수의 연산 처리 소자(212)를 통해 병렬적으로 수행되는 행렬 곱셈 연산의 분할 단위인 미세 블록을 단위로 하여 단계 S21에서 GEMM 변환된 가중치 커널을 프루닝할 수 있다.Next, in step S22 , the pruning unit 120 ′ uses a fine block that is a division unit of a matrix multiplication operation performed in parallel through a plurality of arithmetic processing elements 212 as a unit to obtain the GEMM-converted weight kernel in step S21 . can be pruned.

상술한 설명에서, 단계 S21 및 S22는 본원의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 변경될 수도 있다.In the above description, steps S21 and S22 may be further divided into additional steps or combined into fewer steps, according to an embodiment of the present application. In addition, some steps may be omitted as necessary, and the order between steps may be changed.

본원의 실시예에 따른 그래픽 처리 장치의 구조를 고려한 신경망 프루닝 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The neural network pruning method in consideration of the structure of the graphic processing device according to the embodiment of the present application may be implemented in the form of a program instruction that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the present invention, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and carry out program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

또한, 전술한 그래픽 처리 장치의 구조를 고려한 신경망 프루닝 방법은 기록 매체에 저장되는 컴퓨터에 의해 실행되는 컴퓨터 프로그램 또는 애플리케이션의 형태로도 구현될 수 있다.In addition, the neural network pruning method in consideration of the structure of the above-described graphic processing device may be implemented in the form of a computer program or application executed by a computer stored in a recording medium.

전술한 본원의 설명은 예시를 위한 것이며, 본원이 속하는 기술분야의 통상의 지식을 가진 자는 본원의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The above description of the present application is for illustration, and those of ordinary skill in the art to which the present application pertains will understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present application. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a dispersed form, and likewise components described as distributed may be implemented in a combined form.

본원의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본원의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present application is indicated by the following claims rather than the above detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalents should be construed as being included in the scope of the present application.

100, 100': 그래픽 처리 장치의 구조를 고려한 신경망 프루닝 장치
110, 110': 가중치 변환부
120, 120': 프루닝부
130, 130': 회복 재훈련부
200: 그래픽 처리 장치(Graphic Processing Unit, GPU)
211: 연산 유닛(Compute Unit)
212: 연산 처리 소자(Processing Element, PE)
221: 글로벌 메모리
222: 로컬 메모리
223: 프라이빗 메모리100, 100': Neural network pruning device considering the structure of the graphic processing unit
110, 110': weight conversion unit
120, 120': Pruning part
130, 130': Recovery Retraining Department
200: Graphics Processing Unit (GPU)
211: Compute Unit
212: arithmetic processing element (PE)
221: global memory
222: local memory
223: private memory

Claims

In the neural network pruning method in consideration of the structure of the graphic processing device performed by the neural network pruning device,
GEMM transformation of a weighted kernel of a neural network operating through a graphic processing unit (GPU) including a plurality of computation units; and
Pruning the GEMM-transformed weight kernel using a block, which is a division unit of a matrix multiplication operation performed in parallel through the plurality of operation units, as a unit;
including,
The graphic processing device,
A global memory and a plurality of local memories provided for each of the plurality of operation units,
The block is
Corresponding to a partitioned portion of the GEMM transformed weight kernel copied from the global memory to the local memory to perform the matrix multiplication operation.

delete

According to claim 1,
Each of the plurality of operation units performs the matrix multiplication operation for a plurality of preset blocks,
The pruning step is
Of the plurality of blocks associated with each of the plurality of arithmetic units, a block of a preset ratio is removed, a neural network pruning method.

According to claim 1,
The block removed by the pruning is characterized in that it is not copied from the global memory to the local memory in the reasoning process of the neural network.

4. The method of claim 3,
The pruning step is
calculating weight importance for each of the blocks; and
determining a block to be removed at the preset ratio based on the calculated weight importance;
That comprising a, neural network pruning method.

In the neural network pruning method in consideration of the structure of the graphic processing device performed by the neural network pruning device,
Transforming the weight kernel of a neural network operating through a graphic processing unit (GPU) including a plurality of computation units (Compute Units) each including a plurality of processing elements (PE) to GEMM ; and
Pruning the GEMM-transformed weight kernel by using a fine block, which is a division unit of a matrix multiplication operation performed in parallel through the plurality of operation processing elements, as a unit;
including,
The graphic processing device,
a global memory, a plurality of local memories provided for each of the plurality of arithmetic units, and a plurality of private memories provided for each of the plurality of arithmetic processing elements,
The microblock is
Which corresponds to a weight row included in a block that is a partitioned part of the GEMM transformed weight kernel copied from the global memory to the local memory to perform the matrix multiplication operation.

delete

7. The method of claim 6,
The pruning step is
Of the plurality of microblocks associated with each of the plurality of arithmetic processing elements, a method for pruning a neural network that removes a preset ratio of microblocks.

9. The method of claim 8,
By the pruning step,
A neural network pruning method, characterized in that the computational load of each of the plurality of computational processing elements is determined equally.

9. The method of claim 8,
The pruning step is
calculating weight importance for each of the fine blocks; and
determining the fine blocks to be removed at the preset ratio based on the calculated weight importance;
That comprising a, neural network pruning method.

In the neural network pruning device considering the structure of the graphic processing device,
a weight conversion unit for GEMM conversion of a weight kernel of a neural network operating through a graphic processing unit (GPU) including a plurality of computation units; and
A pruning unit for pruning the GEMM-transformed weight kernel using a block, which is a division unit of a matrix multiplication operation performed in parallel through the plurality of operation units, as a unit;
including,
The graphic processing device,
A global memory and a plurality of local memories provided for each of the plurality of operation units,
The block is
Corresponding to the partitioned portion of the GEMM transformed weight kernel copied from the global memory to the local memory to perform the matrix multiplication operation, a neural network pruning apparatus.

delete

12. The method of claim 11,
The pruning unit,
A neural network pruning apparatus that calculates weight importance for each of the blocks, and determines blocks to be removed at a preset ratio based on the calculated weight importance.

In the neural network pruning device considering the structure of the graphic processing device,
A weight for GEMM transformation of a weighted kernel of a neural network operating through a graphic processing unit (GPU) including a plurality of computation units, each including a plurality of processing elements (PE). conversion unit; and
A pruning unit for pruning the GEMM-transformed weight kernel by using a fine block, which is a division unit of a matrix multiplication operation performed in parallel through the plurality of operation processing elements, as a unit;
including,
The graphic processing device,
a global memory, a plurality of local memories provided for each of the plurality of arithmetic units, and a plurality of private memories provided for each of the plurality of arithmetic processing elements,
The microblock is
Which corresponds to a weight row included in a block that is a partitioned part of the GEMM transformed weight kernel copied from the global memory to the local memory to perform the matrix multiplication operation, a neural network pruning apparatus.

delete

15. The method of claim 14,
The pruning unit,
A neural network pruning device that equally determines the computational load of each of the plurality of arithmetic processing elements by removing a predetermined ratio of the fine blocks among the plurality of fine blocks associated with each of the plurality of arithmetic processing elements.

17. The method of claim 16,
The pruning unit,
A neural network pruning apparatus that calculates weight importance for each of the fine blocks, and determines the fine blocks to be removed at the preset ratio based on the calculated weight importance.