KR20200115239A

KR20200115239A - Apparatus and method for compressing trained deep neural networks

Info

Publication number: KR20200115239A
Application number: KR1020200035403A
Authority: KR
Inventors: 문현철; 김재곤; 천승문; 고현철
Original assignee: (주)인시그널; 한국항공대학교산학협력단
Priority date: 2019-03-26
Filing date: 2020-03-24
Publication date: 2020-10-07
Also published as: KR20210135465A

Abstract

Disclosed are a device and a method for compressing a trained deep neural network. According to an embodiment of the present invention, the device comprises: a parameter reduction unit for simplifying a weight matrix describing a trained deep neural network for processing multimedia content; a quantization unit for quantizing the weight matrix reduced by the parameter reduction unit; and an entropy coding unit for scanning the weight matrix quantized by the quantization unit in a predetermined direction, and then sequentially performing entropy-coding for scanned weights to output a bitstream. The parameter reduction unit uses at least one of a pruning method and a low rank approximation method. According to the present invention, a trained deep neural network can be effectively compressed.

Description

Apparatus and method for compressing trained deep neural networks}

본 발명은 인공 신경망(Artificial Neural Network, ANN) 기술에 관한 것으로, 보다 구체적으로 멀티미디어 콘텐츠의 처리를 위한 훈련된 심층 신경망(Deep Neural Network, DNN)의 압축 장치 및 방법에 관한 것이다.The present invention relates to an artificial neural network (ANN) technology, and more particularly, to a compression apparatus and method of a trained deep neural network (DNN) for processing multimedia contents.

인공 지능(Artificial Intelligence, AI)을 다양한 산업 분야에서 활용하기 위한 시도들이 계속되어 왔다. 특히, 최근의 인공 지능 기술은 생물학적 신경망과 공통된 특정 성능을 갖는 정보 처리 시스템인 신경망(Neural Network, NN)을 활용하면서, 그 성능이 큰 폭으로 향상되고 있으며, 그에 따라 응용 분야도 급속도로 증가하고 있다. Attempts have been made to apply artificial intelligence (AI) in various industries. In particular, the recent artificial intelligence technology uses a neural network (NN), an information processing system having a specific performance in common with a biological neural network, and its performance is greatly improved, and accordingly, the application field is rapidly increasing. have.

이러한 신경망(NN)은 '인공' 신경망(Artificial Neural Network, ANN)이라고도 불린다. 인공 신경망(ANN)은 동물 신경의 행동 특성을 모방하는 분산 병렬 정보 처리 모델이다. ANN에는 서로 연결되어 있는 많은 수의 노드(뉴런이라고도 함)가 존재한다. ANN은 두 가지 특징을 가지고 있다: 1) 각 뉴런은 특정 출력 기능(활성화 기능이라고도 함)을 통해 다른 인접한 뉴런으로부터 가중 입력값을 계산한다. 2) 뉴런들 사이의 정보 전송 강도는 소위 "가중치(weight)"이라고 불리는 것에 의해 측정되며, 그러한 가중치는 특정한 알고리즘의 자기 학습에 의해 조정될 수 있다.This neural network (NN) is also called an'artificial' neural network (ANN). Artificial Neural Networks (ANNs) are distributed parallel information processing models that mimic the behavioral characteristics of animal neurons. ANN has a large number of nodes (also called neurons) that are connected to each other. The ANN has two characteristics: 1) Each neuron computes a weighted input from another adjacent neuron through a specific output function (also called an activation function). 2) The intensity of information transmission between neurons is measured by so-called "weights", and such weights can be adjusted by self-learning of specific algorithms.

ANN은 신경망에 포함되는 변수 및 토폴로지 관계를 지정하기 위해 상이한 아키텍쳐를 사용할 수 있다. 신경망에 포함되는 파라미터는 뉴런의 활동과 함께 뉴런들 간의 연결의 가중치일 수 있다. 신경망 토폴로지의 유형으로 피드 포워드 네트워크와 역방향 전파 신경망(backward propagation neural network)이 있다. 전자에서는 동일한 계층에서 서로 연결된 각 계층 내의 노드가 다음 스테이지로 공급되는데, 제공되는 입력 패턴에 따라 연결의 가중치를 수정하는 '학습 규칙'의 일부 형태를 포함한다. 후자에서는 가중 조정치의 역방향 에러 전파를 허용하는 것으로, 전자보다 진보된 신경망이다. ANNs can use different architectures to designate variables and topology relationships included in neural networks. A parameter included in the neural network may be a weight of a connection between neurons as well as activity of neurons. There are feed forward networks and backward propagation neural networks as types of neural network topologies. In the former, nodes within each layer connected to each other in the same layer are supplied to the next stage, which includes some form of'learning rule' that modifies the weight of the connection according to the provided input pattern. The latter allows the propagation of errors in the reverse direction of the weighting adjustment, which is more advanced than the former.

심층 신경망(Deep Neural Network, DNN)은 다수의 레벨의 상호 연결된 노드를 갖는 신경망에 대응하여 매우 비선형이고 고도로 변화하는 기능을 콤팩트하게 표현할 수 있다. 그럼에도 불구하고, 다수의 계층과 연관된 노드의 수와 함께 DNN에 대한 계산 복잡도가 급격히 상승한다. 최근까지 이러한 DNN을 학습 또는 훈련(training)시키기 위한 효율적인 연산 방법들이 개발되고 있다. DNN의 학습 속도가 획기적으로 높아짐에 따라, 음성 인식, 이미지 세분화, 물체 감지, 안면 인식 등의 다양하고 복잡한 작업에 성공적으로 적용되고 있다. A deep neural network (DNN) responds to a neural network having multiple levels of interconnected nodes and can compactly express a highly nonlinear and highly variable function. Nevertheless, with the number of nodes associated with multiple layers, the computational complexity for the DNN increases rapidly. Until recently, efficient computation methods for learning or training such DNNs have been developed. As the learning speed of DNN increases dramatically, it has been successfully applied to various and complex tasks such as speech recognition, image segmentation, object detection, and facial recognition.

멀티미디어 콘텐츠, 예컨데 비디오의 압축 및 복원도 이러한 DNN의 적용이 시도되고 있는 분야의 하나이다. 현재 차세대 비디오 코딩으로 고효율 비디오 코딩(High Efficiency Video Coding, HEVC)이 ITU-T(비디오 코딩 전문가 그룹) 및 ISO/IEC MPEG(동영상 전문가 그룹) 표준화 조직의 공동 비디오 프로젝트에 의하여 개발되어 국제 표준으로 채택되어 사용되고 있으며, DNN을 HEVC 등과 같은 새로운 비디오 코딩 표준에 적용함으로써, 그 성능을 더욱 향상시키는 것이 가능하다는 것이 알려져 있다. 이러한 시도의 하나가 한국공개특허 제10-2018-0052651호, "비디오 코딩에서의 신경망 기반 프로세싱의 방법 및 장치"에 개시되어 있다. Multimedia content, such as video compression and decompression, is also one of the areas where DNN is being applied. High Efficiency Video Coding (HEVC) as the current next-generation video coding was developed by the joint video project of ITU-T (Video Coding Expert Group) and ISO/IEC MPEG (Video Expert Group) standardization organization and adopted as an international standard. It has been used, and it is known that it is possible to further improve its performance by applying DNN to a new video coding standard such as HEVC. One such attempt is disclosed in Korean Patent Laid-Open Publication No. 10-2018-0052651, "Method and apparatus of neural network-based processing in video coding."

그러나, 신경망의 규모는 최근 몇 년 동안 급속한 발전으로 인해 폭발하고 있다. 몇몇 진보된 신경망 모델들은 수백 개의 층과 수십억 개의 연결을 가지고 있을 것이다. 그리고 그것의 구현은 계산-중심과 기억-중심 둘 다이다. However, the scale of neural networks has been exploding due to rapid development in recent years. Some advanced neural network models will have hundreds of layers and billions of connections. And its implementation is both computation-centric and memory-centric.

신경망이 점점 커지고 있기 때문에, 이동 단말기 등과 같이 스토리지 용량, 프로세서의 성능, 통신 대역폭 등에 제약이 있는 장치에서 적용하기 위해서는 신경망 모델을 작은 크기로 만드는 것, 즉 압축이 상당히 중요하다. 특히, 이동 단말기에서 중요한 어플리케이션으로 활용되는 멀티미디어 콘텐츠의 생산 및 소비를 위한 비디오 코딩 어플리케이션에 신경망 모델을 적용하기 위해서는, 작은 크기의 신경망 모델이 필수적이다. 하지만, 현재까지는 신경망 모델을 압축하고 또한 압축된 신경망 모델을 복원하는 기술과 관련해서는, 충분한 연구가 이루어지지 않고 있다. Since neural networks are getting bigger, it is very important to make the neural network model small, that is, compression, in order to be applied to devices such as mobile terminals that are limited in storage capacity, processor performance, and communication bandwidth. In particular, in order to apply a neural network model to a video coding application for the production and consumption of multimedia content used as an important application in a mobile terminal, a neural network model of a small size is essential. However, up to now, sufficient research has not been conducted on the technique of compressing the neural network model and restoring the compressed neural network model.

한국공개특허 제10-2018-0052651호, "비디오 코딩에서의 신경망 기반 프로세싱의 방법 및 장치"Korean Patent Publication No. 10-2018-0052651, "Method and apparatus for processing based on neural networks in video coding"

본 발명이 해결하고자 하는 하나의 과제는, 신경망의 성능을 저하를 최소화하면서 이동 단말기 등과 같이 스토리지, 프로세서의 성능, 통신 대역폭 등에 제약이 있는 장치에 적용할 수 있는, 멀티미디어 콘텐츠의 기술, 분석, 처리를 위한 훈련된 심층 신경망의 압축 장치 및 방법을 제공하는 것이다.One problem to be solved by the present invention is the technology, analysis, and processing of multimedia contents, which can be applied to devices such as mobile terminals, which are limited in storage, processor performance, and communication bandwidth, while minimizing deterioration of the performance of neural networks. It is to provide a training apparatus and method for compressing a deep neural network.

상기한 과제를 해결하기 위한 본 발명의 일 실시예에 따른 심층 신경망 압축 장치는 심층 신경망을 기술하는 가중치 매트릭스(weight matrix)를 간략화하기 위한 파라미터 축소 유닛(parameter reduction unit), 상기 파라미터 축소 유닛에 의하여 축소된 상기 가중치 매트릭스를 양자화하기 위한 양자화 유닛(quantization unit) 및 상기 양자화 유닛에 의하여 양자화된 가중치 매트릭스를 소정의 방향으로 스캔한 다음, 스캔된 가중치들을 순차적으로 엔트로피 코딩하여 비트스트림으로 출력하기 위한 엔트로피 코딩 유닛(entropy coding unit)을 포함하고, 상기 파라미터 축소 유닛은 가지치기(pruning) 기법 및 저계층 근사화(low rank approximation) 기법 중에서 적어도 하나의 기법을 이용하여 상기 가중치 매트릭스를 간략화한다.A deep neural network compression apparatus according to an embodiment of the present invention for solving the above problem is a parameter reduction unit for simplifying a weight matrix describing a deep neural network, and the parameter reduction unit. Entropy for scanning a quantization unit for quantizing the reduced weight matrix and a weight matrix quantized by the quantization unit in a predetermined direction, then sequentially entropy coding the scanned weights and outputting them as a bitstream It includes an entropy coding unit, and the parameter reduction unit simplifies the weight matrix by using at least one of a pruning technique and a low rank approximation technique.

상기 실시예의 일 측면에 의하면, 상기 파라미터 축소 유닛은 상기 가중치 매트릭스의 간략화에 사용한 기법을 지시하는 정보를 출력한다. According to an aspect of the embodiment, the parameter reduction unit outputs information indicating a technique used for simplification of the weight matrix.

상기 실시예의 다른 측면에 의하면, 상기 저계층 근사화 기법은, 상기 훈련된 심층 신경망의 최초 가중치 매트릭스(original weight matrix)를, 복수 개의 보다 낮은 차원의 저차원 가중치 매트릭스(low rank weight matrix)로 분해하여 표현한다. 이 경우에, 상기 파라미터 축소 유닛은, 상기 저계층 근사화 기법에서 사용되는 순환 오퍼레이터(circulant operator), 랭크값(rank value), 상기 최초 가중치 매트릭스의 차원과 형상(dimension and shape) 정보, 및 리쉐이핑 모드(reshaping mode) 중에서, 하나 이상을 출력할 수 있다. According to another aspect of the embodiment, the low-layer approximation technique is performed by decomposing the original weight matrix of the trained deep neural network into a plurality of lower-dimensional low rank weight matrices. Express. In this case, the parameter reduction unit includes a circulant operator, a rank value, dimension and shape information of the initial weight matrix, and reshaping used in the low-level approximation technique. Among the modes (reshaping mode), one or more can be output.

상기한 기술적 과제를 달성하기 위한 본 발명의 다른 실시예에 따른 심층 신경망의 압축 방법은, 훈련된 심층 신경망을 기술하는 가중치 매트릭스를 간략화하기 위한 파라미터 축소 단계, 상기 파라미터 축소 단계에서 축소된 상기 가중치 매트릭스를 양자화하기 위한 양자화 단계 및 상기 양자화 단계에서 양자화된 가중치 매트릭스를 소정의 방향으로 스캔하고, 스캔된 가중치들을 순차적으로 엔트로피 코딩하여 비트스트림으로 출력하기 위한 엔트로피 코딩 단계를 포함하고, 상기 파라미터 축소 단계에서는, 가지치기(pruning) 기법 및 저계층 근사화(Low Rank Approximation) 기법 중에서 적어도 하나의 기법을 이용한다. A method for compressing a deep neural network according to another embodiment of the present invention for achieving the above technical problem includes a parameter reduction step for simplifying a weight matrix describing a trained deep neural network, and the weight matrix reduced in the parameter reduction step. And an entropy coding step of scanning the weight matrix quantized in the quantization step in a predetermined direction and sequentially entropy coding the scanned weights to output a bitstream, and the parameter reduction step , At least one of the pruning technique and the low rank approximation technique is used.

본 발명의 실시예에 의하면, 이동 단말기 등과 같이 스토리지, 프로세서의 성능, 통신 대역폭에 제약이 있는 장치에서도, 멀티미디어 콘텐츠의 처리 등에 심층 신경망을 적용할 수 있도록, 훈련된 심층 신경망을 효과적으로 압축하는 것이 가능하다. According to an embodiment of the present invention, it is possible to effectively compress a trained deep neural network so that the deep neural network can be applied to the processing of multimedia contents, even in a device such as a mobile terminal, which has limitations in storage, processor performance, and communication bandwidth. Do.

도 1은 심층 신경망의 압축 및 복원을 위한 프레임워크의 구성을 보여 주는 블록도이다.
도 2는 본 발명의 일 실시예에 따른 멀티미디어 콘텐츠 처리를 위한 훈련된 심층 신경망의 압축/복원 장치가 구현된 컴퓨터 시스템의 구체적인 구성을 보여 주는 블록도이다.
도 3은 본 발명의 일 실시예에 따른 심층 신경망 압축 장치의 구성의 일례를 보여 주는 블록도이다.
도 4는 본 발명의 일 실시예에 따른 훈련된 심층 신경망의 일례인 컨볼루션 신경망(CNN)을 보여 주는 도면이다.
도 5는 가지치기 기법에 따른 처리 과정의 일례를 보여 주는 흐름도이다.
도 6은 매트릭스 분해 과정의 일례를 보여 주는 흐름도이다.
도 7은 균일양자화 과정의 일례를 보여 주는 흐름도이다.
도 8은 적응적 양자화 과정의 일례를 보여 주는 흐름도이다.
도 9는 본 발명의 일 실시예에 따른 심층 신경망 복원 장치의 구성의 일례를 보여 주는 블록도이다.1 is a block diagram showing the configuration of a framework for compression and decompression of a deep neural network.
2 is a block diagram showing a detailed configuration of a computer system in which a device for compressing/decompressing a trained deep neural network for processing multimedia contents according to an embodiment of the present invention is implemented.
3 is a block diagram showing an example of the configuration of a deep neural network compression apparatus according to an embodiment of the present invention.
4 is a diagram illustrating a convolutional neural network (CNN), which is an example of a trained deep neural network according to an embodiment of the present invention.
5 is a flowchart showing an example of a processing procedure according to a pruning technique.
6 is a flowchart showing an example of a matrix decomposition process.
7 is a flow chart showing an example of a uniform quantization process.
8 is a flowchart showing an example of an adaptive quantization process.
9 is a block diagram showing an example of a configuration of an apparatus for reconstructing a deep neural network according to an embodiment of the present invention.

이하, 도면을 참조하여 본 발명의 바람직한 실시형태 및 실시예를 설명한다. 다만, 이하의 실시형태 및 실시예는 본 발명의 바람직한 구성을 예시적으로 나타내는 것일 뿐이며, 본 발명의 범위는 이들 구성에 한정되지 않는다. 그리고 이하의 설명에 있어서, 장치의 하드웨어 구성 및 소프트웨어 구성, 처리 흐름, 제조조건, 크기, 재질, 형상 등은, 특히 특정적인 기재가 없는 한, 본 발명의 범위를 이것으로 한정하려는 취지인 것은 아니다.Hereinafter, preferred embodiments and examples of the present invention will be described with reference to the drawings. However, the following embodiments and examples are merely illustrative of preferred configurations of the present invention, and the scope of the present invention is not limited to these configurations. And in the following description, the hardware configuration and software configuration of the device, processing flow, manufacturing conditions, size, material, shape, etc. are not intended to limit the scope of the present invention to this, unless specifically stated otherwise. .

신경망 압축 및 복원 시스템Neural network compression and restoration system

도 1은 심층 신경망의 압축 및 복원을 위한 프레임워크의 구성을 보여 주는 블록도이다. 도 1을 참조하면, 신경망의 압축 및 복원을 위한 프레임워크는 파라미터 축소 모듈(parameter reduction module, Part 1), 파라미터 근사화 모듈(parameter approximation module, Part 2), 재구성 모듈(reconstruction module, Part 3), 인코딩 모듈(encoding module, Part 4) 및 디코딩 모듈(decoding module, Part 5)을 포함한다. 도 1에서 최초 신경망 모델(Original NN Model)과 재구성된 신경망 모델(Reconstructed NN Model)은 동일한 모델 포맷으로 구성됨을 가정한다.1 is a block diagram showing the configuration of a framework for compression and decompression of a deep neural network. 1, a framework for compressing and reconstructing a neural network includes a parameter reduction module (Part 1), a parameter approximation module (Part 2), a reconstruction module (Part 3), It includes an encoding module (Part 4) and a decoding module (Decoding module, Part 5). In FIG. 1, it is assumed that the original neural network model (Original NN Model) and the reconstructed neural network model (Reconstructed NN Model) are configured in the same model format.

파라미터 축소 모듈(Part 1)에서의 과정은, 전처리의 관점에서 수행하는 것으로, 임의적인 과정이다. 따라서 파라미터 축소 모듈(Part 1)에서의 과정을 거치지 않고, 최초 신경망 모델(Original NN Model)이 바로 파라미터 근사화 모듈(Part 2)로 입력될 수도 있다. 다만, 파라미터 축소 모듈(Part 1)을 거치는 경우에는, 해당 모듈의 수행과 관련된 정보(메타데이터 등)가 생성되어서, 신경망 복원을 위해 압축 및 전송될 수 있다. 예컨대, 가지치기 기법 및/또는 매트릭스 분해 기법의 수행 여부를 지시하는 정보(플래그)와 함께 해당 기법의 수행과 관련된 정보들이 메타데이터로서, 파라미터 축소 모듈(Part 1)의 출력이 될 수 있는데, 이에 대해서는 후술한다. 그리고 출력된 정보는 복원을 위하여 재구성 모듈(Part 3) 및/또는 디코딩 모듈(Part 5)로 전송될 수 있다. 파라미터 축소 모듈(Part 1)에서 수행되는 구체적인 과정에 대해서는 후술한다.The process in the parameter reduction module (Part 1) is performed from the viewpoint of pre-processing and is an arbitrary process. Therefore, the original neural network model (Original NN Model) may be directly input to the parameter approximation module (Part 2) without going through the process in the parameter reduction module (Part 1). However, in the case of passing through the parameter reduction module (Part 1), information (metadata, etc.) related to the execution of the corresponding module is generated, and may be compressed and transmitted to reconstruct a neural network. For example, information (flags) indicating whether to perform the pruning technique and/or the matrix decomposition technique, as well as information related to the execution of the technique, are metadata, which may be output of the parameter reduction module (Part 1). It will be described later. In addition, the outputted information may be transmitted to the reconstruction module (Part 3) and/or the decoding module (Part 5) for restoration. A detailed process performed in the parameter reduction module (Part 1) will be described later.

파라미터 근사화 모듈(Part 2)에서는 추출된 파라미터 텐서들에 파라미터 근사화 기법을 적용한다. 파라미터 근사화 기법은 예컨대, 양자화(quantization), 변환(transfomation), 예측(prediction) 등의 기법을 포함한다. 파라미터 근사화 모듈(Part 2)에서 수행되는 과정의 일례로서, 양자화 과정에 대해서는 후술한다.In the parameter approximation module (Part 2), a parameter approximation technique is applied to the extracted parameter tensors. The parameter approximation technique includes, for example, techniques such as quantization, transformation, and prediction. As an example of a process performed in the parameter approximation module (Part 2), a quantization process will be described later.

재구성 모듈(Part 3)에서는, 파라미터 근사화 모듈(Part 2)에 의해서 생성되는 근사화된 파라미터 텐서를 복원한다. 일반적으로, 비트스트림에서 복원된 메타데이터 정보를 이용하여, 근사화된 파라미터 텐서를 복원할 수 있다. 재구성 모듈(Part 3)로의 입력은, 파라미터 근사화 모듈(Part 2)의 출력과 동일한 포맷을 갖는다. 그리고 재구성 모듈(Part 3)로부터의 출력은, 파라미터 근사화 모듈(Part 2)의 입력과 동일한 포맷을 갖는다.In the reconstruction module (Part 3), the approximated parameter tensor generated by the parameter approximation module (Part 2) is restored. In general, an approximate parameter tensor can be reconstructed using metadata information reconstructed from a bitstream. The input to the reconfiguration module (Part 3) has the same format as the output of the parameter approximation module (Part 2). And the output from the reconfiguration module (Part 3) has the same format as the input of the parameter approximation module (Part 2).

인코딩 모듈(Part 4)은 엔트로피 인코딩 기법을 사용하며, 이에 대해서는 후술한다. 그리고 디코딩 모듈(Part 5)에서는 입력되는 비트스트림에 대한 무손실 디코딩을 수행한다. 디코딩 모듈(Part 5)의 입력은, 인코딩 모듈(Part 4)에서 특정되는 비트스트림이며, 또한 아이템이 없는 빈 사전의 경우에는 인코딩 모듈(Part 4)의 입력과 동일한 포맷을 가진다. 디코딩 모듈(Part 5)의 출력은, 파라미터 근사화 모듈(Part 2)의 입력과 동일하 포맷의 사전 형태를 갖는다. 그러나 디코딩 모듈(Part 5)의 출력은, 파라미터 근사화 모듈의 출력 내에 존재하는 파라미터, 메타데이터, int_parameter와 동일해야 한다(loseless coding). The encoding module (Part 4) uses an entropy encoding technique, which will be described later. In addition, the decoding module (Part 5) performs lossless decoding on the input bitstream. The input of the decoding module (Part 5) is a bitstream specified in the encoding module (Part 4), and in the case of an empty dictionary without items, it has the same format as the input of the encoding module (Part 4). The output of the decoding module (Part 5) has the same format as the input of the parameter approximation module (Part 2) and has a dictionary format. However, the output of the decoding module (Part 5) must be the same as the parameters, metadata, and int_parameter existing in the output of the parameter approximation module (loseless coding).

도 2는 본 발명의 일 실시예에 따른 훈련된 심층 신경망의 압축 장치 또는 복원 장치가 구현된 컴퓨터 시스템(100)의 구체적인 구성을 보여 주는 블록도이다. FIG. 2 is a block diagram showing a detailed configuration of a computer system 100 in which an apparatus for compressing or reconstructing a trained deep neural network according to an embodiment of the present invention is implemented.

도 2를 참조하면, 컴퓨터 시스템(100)은 하나 또는 이상의 프로세서(110), 입출력 장치 인터페이스(120), 네트워크 인터페이스(130), 인터컨넥터(BUS, 140), 메모리(150) 및 스토리지(160)를 포함한다. 이러한 컴퓨터 시스템(100)은 단일의 컴퓨팅 장치로 구성한 특정한 하나의 장치이거나 또는 하나 이상의 프로세서와 하나 이상의 관련 메모리를 포함하여 구성된 다수의 장치의 집합일 수 있다.Referring to FIG. 2, the computer system 100 includes one or more processors 110, an input/output device interface 120, a network interface 130, an interconnector (BUS, 140), a memory 150, and a storage 160. Includes. The computer system 100 may be one specific device configured as a single computing device, or may be a set of multiple devices including one or more processors and one or more associated memories.

프로세서(110)는 메모리(150) 또는 스토리지(160)에 저장되어 있는 프로그래밍 명령어를 가져와서 실행한다. 마찬가지로, 프로세서(110)는 메모리(150)에 어플리케이션 데이터를 저장하거나 또는 가져온다. 입출력 장치 인터페이스(120)는, 키보드, 디스플레이 및 마우스 장치 등과 같은 입출력 장치(12)를 컴퓨터 시스템(100)에 연결하기 위한 것이다. 네트워크 인터페이스(130)는 유선이나 무선을 통해 자체망(인트라넷)이나 인터넷, 무신통신 네트워크 등과 같은 외부망과 통신하기 위한 것으로, 데이터 통신 네트워크(14)를 통해 데이터를 전송한다. The processor 110 fetches and executes the programming instructions stored in the memory 150 or the storage 160. Similarly, the processor 110 stores or retrieves application data in the memory 150. The input/output device interface 120 is for connecting the input/output device 12 such as a keyboard, a display, and a mouse device to the computer system 100. The network interface 130 communicates with an external network such as an internal network (intranet), the Internet, or a wireless communication network through wired or wireless communication, and transmits data through the data communication network 14.

인터컨넥터(140)는, 프로세서(110)와 입출력 장치 인터페이스(120), 스토리지(160), 네트워크 인터페이스(130) 및 메모리(150) 각각의 사이에서, 프로그래밍 명령어 및 어플리케이션 데이터를 전송하는 기능을 수행한다. 이러한 인터컨넥터(140)는 하나 이상의 버스(BUS)일 수 있다. 프로세서(110)는 단일의 중앙처리장치(CPU)이거나 또는 복수의 CPU, 다양한 구현예에서 복수의 프로세싱 코어를 갖는 단일의 CPU로 구현될 수 있다. 일 측면에 의하면, 프로세서(110)는 디지털 신호 프로세서(Digital Signal Processor, DSP)일 수 있다. The interconnector 140 performs a function of transmitting programming commands and application data between the processor 110 and the input/output device interface 120, the storage 160, the network interface 130, and the memory 150. do. The interconnector 140 may be one or more buses. The processor 110 may be implemented as a single central processing unit (CPU) or as a single CPU having multiple CPUs, and multiple processing cores in various implementations. According to one aspect, the processor 110 may be a digital signal processor (DSP).

메모리(150)는 일반적으로 SRAM(Static Random Access Memory), DRAM(Dynamic Random Access Memory) 또는 플래시(Flash) 등과 같은 랜덤 엑세스 메모리를 포함한다. 스토리지(160)는 일반적으로 하드 디스크 드라이브, SSD(Solid State Device), 제거 가능한 메모리 카드, 광 스토리지, 플래시 메모리 디바이스, NAS(Network Attached Storage) 또는 SAN(Storage Area Device)에의 연결(connections) 등과 같은 비휘발성 메모리를 포함한다. The memory 150 generally includes a random access memory such as a static random access memory (SRAM), a dynamic random access memory (DRAM), or a flash. Storage 160 is generally a hard disk drive, solid state device (SSD), removable memory card, optical storage, flash memory device, such as connections to a network attached storage (NAS) or a storage area device (SAN). Includes non-volatile memory.

컴퓨터 시스템(100)은 하나 이상의 운영 체제(Operating System, OS, 164)를 포함할 수 있다. 운영 체제(164)는 일부는 메모리(150)에 저장되고 나머지 일부는 스토리지(160)에 저장될 수 있다. 이와는 달리, 운영 체제(164)는 전체가 메모리(150)에 저장되거나 또는 스토리지(160)에 저장될 수도 있다. 운영 체제(164)는, 프로세서(110), 입출력 장치 인터페이스(110), 네트워크 인터페이스(130) 등과 같은 다양한 하드웨어 리소스들 사이에서 인터페이스를 제공한다. 또한, 운영 체제(164)는 어플리케이션 프로그램을 위한 시간 기능(time function) 등과 같은 공통 서비스를 제공한다.The computer system 100 may include one or more operating systems (OS) 164. A part of the operating system 164 may be stored in the memory 150 and a part of the operating system 164 may be stored in the storage 160. Alternatively, the entire operating system 164 may be stored in the memory 150 or may be stored in the storage 160. The operating system 164 provides an interface between various hardware resources such as the processor 110, the input/output device interface 110, the network interface 130, and the like. In addition, the operating system 164 provides common services such as a time function for an application program.

심층 신경망 압축 장치(152)는 멀티미디어 컨텐츠의 기술, 분석, 처리 등을 위한 훈련된 심층 신경망, 예컨대 심층 컨블루션 신경망(Deep Convolutional Neural Network)을 압축하여 비트스트림으로 출력한다. 심층 신경망에 의한 멀티미디어 컨텐츠의 기술, 분석, 처리는, 심층 신경망 기술을 이용하는 동영상의 압축 및 복원, 보안 등을 위한 안면, 홍채, 지문, 음성 인식, 감시 등을 위한 동작 감지, 질병의 진단이나 분석, 자율주행 등과 같이 동영상, 정지영상, 음성 등과 같은 멀티미디어 콘텐츠를 이용하는 다양한 어플리케이션을 포함한다. The deep neural network compression device 152 compresses a trained deep neural network for describing, analyzing, and processing multimedia contents, for example, a deep convolutional neural network, and outputs the compressed data as a bitstream. The technology, analysis, and processing of multimedia contents by the deep neural network is to compress and restore motion pictures using deep neural network technology, facial, iris, fingerprint, voice recognition, and motion detection for security, etc., and diagnosis or analysis of diseases. , Including various applications using multimedia contents such as moving pictures, still images, and audio such as autonomous driving.

예를 들어, 동영상의 압축 또는 복원을 위한 장치 또는 이를 구성하는 코딩 도구는, 심층 신경망에 의한 멀티미디어 컨텐츠의 처리가 적용될 수 있는 대표적인 분야이다. 이에 의하면, 심층 신경망 압축 장치(152)는 학습된 또는 훈련된 코딩 도구의 심층 신경망을 압축하여, 호환 가능한 포맷으로 기술하기 위한 수단이다. 즉, 심층 신경망 압축 장치(152)로부터의 결과물인 압축된 심층 신경망은 코딩 도구 신경망의 상호운용 가능한 압축 표현(interoperable compressed representation of neural networks)에 해당된다. For example, an apparatus for compressing or reconstructing a video or a coding tool constituting the same is a representative field to which processing of multimedia contents by a deep neural network can be applied. According to this, the deep neural network compression device 152 is a means for compressing the deep neural network of the learned or trained coding tool and describing it in a compatible format. That is, the compressed deep neural network resulting from the deep neural network compression device 152 corresponds to an interoperable compressed representation of neural networks of the coding tool neural network.

훈련된 심층 신경망을 압축하기 위하여, 심층 신경망 압축 장치(152)는, 훈련된 심층 신경망에 대한 파라미터 축소(parameter reduction), 양자화(quantization) 및 엔트로피 코딩(entropy coding)을 포함하는 일련의 과정을 수행하여, 부호화된 비트스트림을 출력한다. 본 실시예에 의하면, 파라미터 축소 과정에서는 가지치기(pruning) 기법 및/또는 행렬 분해(matrix decomposition) 기법을 이용한다. 일례로, 행렬 분해에 의한 파라미터 축소 이전에, 심층 신경망을 구성하는 일부 가중치를 제거하거나 또는 가중치들 사이의 연결 관계를 단순화하는 가지치기(pruning) 과정이 추가로 수행될 수도 있다.In order to compress the trained deep neural network, the deep neural network compression device 152 performs a series of processes including parameter reduction, quantization, and entropy coding for the trained deep neural network. Thus, the encoded bitstream is output. According to the present embodiment, a pruning technique and/or a matrix decomposition technique are used in the parameter reduction process. For example, before parameter reduction by matrix decomposition, some weights constituting the deep neural network may be removed, or a pruning process for simplifying a connection relationship between weights may be additionally performed.

도 3은 본 발명의 일 실시예에 따른 심층 신경망 압축 장치(152)의 구성의 일례를 보여 주는 블록도이다. 도 3을 참조하면, 심층 신경망 압축 장치(152)는 파라미터 축소 유닛(22), 양자화 유닛(24) 및 엔트로피코딩 유닛(26)을 포함한다. 전술한 바와 같이, 심층 신경망 압축 장치(152)는, 멀티미디어 콘텐츠의 분석을 위한 훈련된 심층 신경망(즉, 훈련된 심층 신경망을 기술하는 가중치 및 파라미터들)을 소정의 알고리즘을 사용하여 압축한 다음, 최종적으로 부호화된 비트스트림을 출력한다.3 is a block diagram showing an example of the configuration of the deep neural network compression apparatus 152 according to an embodiment of the present invention. Referring to FIG. 3, the deep neural network compression apparatus 152 includes a parameter reduction unit 22, a quantization unit 24, and an entropy coding unit 26. As described above, the deep neural network compression apparatus 152 compresses a trained deep neural network for analyzing multimedia contents (that is, weights and parameters describing the trained deep neural network) using a predetermined algorithm, and then, Finally, the encoded bitstream is output.

심층 신경망 압축 장치(152)으로의 입력은 훈련된 심층 신경망을 기술하는 다양한 정보와 파라미터들을 포함한다. The input to the deep neural network compression device 152 includes various information and parameters describing the trained deep neural network.

우선, 심층 신경망 압축 장치(152)으로의 입력은 여러가지 상위 레벨 정보를 포함한다. 예를 들어, 해당 심층 신경망이 동영상의 코딩을 위한 코딩 도구로 적용되는 경우, 상위 레벨 정보는 심층 신경망 기반 코딩 도구의 유형을 지시하는 정보를 포함할 수 있다. 심층 신경망 기반 코딩 도구는 2가지 유형이 존재할 수 있다. 보다 구체적으로, 심층 신경망 기반 코딩 도구는 인코더 및 디코더 모두에 필수적인 기능을 구현하는 제1 유형 코딩 도구와, 인코더와 디코더 중에서 어느 하나에만 필수적인 기능을 구현하는 제2 유형 코딩 도구의 두 가지가 존재한다. 이것은 이미지/비디오 코딩에서, 일부의 코딩 도구, 즉 제1 유형 코딩 도구는 인코더와 디코더 모두에 요구되는 것이고, 나머지 다른 일부의 코딩 도구, 즉 제2 유형 코딩 도구는 인코더와 디코더 중에서 어느 하나에만 요구되는 기능이기 때문이다. 예를 들어, 비디오 코딩 과정에서, 인-루프 필터링 과정은 인코더와 디코더 모두에서 행해지는 제1 유형 코딩 도구의 기능에 해당하지만, 인트라 모드 예측 과정은 오직 인코더에서만 행해지는 제2 유형 코딩 도구의 기능에 해당되며, 디코더로는 오직 결정된 예측 모드 정보만이 보내진다. 따라서 두 가지 유형의 DNN 기반 코딩 도구가 고려되어야 하며, 이러한 DNN 기반 코딩 도구의 유형(type)은 반드시 훈련된 심층 신경망 기반 코딩 도구의 상위 레벨 정보로서 표시가 되어야 한다.First, the input to the deep neural network compression device 152 includes various high-level information. For example, when the corresponding deep neural network is applied as a coding tool for coding a video, the high-level information may include information indicating the type of the deep neural network-based coding tool. There are two types of deep neural network based coding tools. More specifically, there are two types of coding tools based on deep neural networks: a first type coding tool that implements essential functions for both an encoder and a decoder, and a second type coding tool that implements a function essential for only one of the encoder and the decoder. . This is, in image/video coding, some coding tools, i.e., the first type coding tool, are required for both the encoder and the decoder, and the rest of the coding tools, i. This is because it is a function to be used. For example, in the video coding process, the in-loop filtering process corresponds to the function of the first type coding tool performed at both the encoder and the decoder, but the intra mode prediction process is the function of the second type coding tool performed only at the encoder. And only determined prediction mode information is sent to the decoder. Therefore, two types of DNN-based coding tools must be considered, and the type of these DNN-based coding tools must be indicated as high-level information of the trained deep neural network-based coding tool.

그리고 상위 레벨 정보로는 DNN 기반 멀티미디어 콘텐츠 처리 장치(이하, 'DNN 기반 처리 장치'라 한다)의 전반적인 구성에 관한 정보를 포함한다. 보다 구체적으로, DNN 기반 처리 장치의 구성과 관련된 상위 레벨 정보로는, 인식(recognition), 분류(classification), 생성(generation), 차별화(discrimination) 등과 같은 해당 신경망의 기본 기능의 관점에서 본 타겟 어플리케이션(target application)에 관한 정보, DNN 기반 처리 장치의 유형을 지시하는 정보, 인코더가 특정 부호화 과정의 수행시에 훈련된 도구 신경망을 추론 엔진에 적용하는 것과 규격화된 이미지 또는 비디오 부호화 도구를 적용하는 것 중에서 무엇을 선택하였는지를 지시하는 정보, 최적화된 콘텐츠 유형(customized content type)에 관한 정보, 오토인코더(autoencoder), CNN(Convolutional Neural Network), GAN(Generative Adversarial Network), RNN(Recurrent Neural Network) 등과 같은 훈련된 심층 신경망 기반 신경망의 알고리즘에 관한 기초 정보, 트레이닝 데이터 및/또는 테스트 데이터에 관한 기본 정보, 메모리 용량 및 컴퓨팅 파워의 관점에서 추론 엔진에 요구되는 능력에 관한 정보, 모델 압축에 관한 정보 등을 포함한다.In addition, the high-level information includes information on the overall configuration of the DNN-based multimedia content processing device (hereinafter, referred to as'DNN-based processing device'). More specifically, the high-level information related to the configuration of the DNN-based processing device is a target application viewed from the perspective of basic functions of the neural network such as recognition, classification, generation, and discrimination. (target application) information, information indicating the type of DNN-based processing device, applying a neural network trained to the inference engine when the encoder performs a specific coding process and applying a standardized image or video coding tool Information indicating what is selected from among them, information on customized content type, autoencoder, CNN (Convolutional Neural Network), GAN (Generative Adversarial Network), RNN (Recurrent Neural Network), etc. Basic information about the algorithm of the trained deep neural network based neural network, basic information about training data and/or test data, information about the capability required by the inference engine in terms of memory capacity and computing power, information about model compression, etc. Include.

심층 신경망 압축 장치(152)으로의 입력은 또한 훈련된 심층 신경망를 기술하는 다양한 파라미터들을 포함한다. 다양한 파라미터들은 신경망 모델에 관한 정보, 예컨대 커널, 뉴런, 연결의 가중치들을 포함한다. 이하, 심층 신경망의 일례인 컨볼루션 신경망(CNN)의 아키텍쳐를 참조하여, 이에 대하여 보다 구체적으로 설명한다.The input to the deep neural network compression device 152 also includes various parameters describing the trained deep neural network. Various parameters include information about the neural network model, such as weights of kernels, neurons, and connections. Hereinafter, with reference to the architecture of a convolutional neural network (CNN), which is an example of a deep neural network, this will be described in more detail.

도 4는 본 발명의 일 실시예에 따른 훈련된 심층 신경망의 일례를 보여 주는 도면으로서, 컨볼루션 신경망(CNN)인 경우이다. 도 4를 참조하면, CNN(200)은 입력 계층(imput layer, 210), 컨볼루션 계층(convolutional layer, 215), 서브샘플링 계층(220), 컨볼루션 계층(225), 서브샘플링 계층(230), 전연결 계층(fully connected(FC) layer, 235, 240) 및 출력 계층(245)를 포함한다. 도 4에 도시된 예에서 입력 계층(210)은 32×32 픽셀 이미지를 받아들이도록 구성되어 있으며, 컨볼루션 계층(215)은 입력 계층으로부터 6개의 28×28 특성맵(feature map)을 생성한다. 4 is a diagram showing an example of a trained deep neural network according to an embodiment of the present invention, in the case of a convolutional neural network (CNN). 4, the CNN 200 is an input layer (imput layer, 210), a convolutional layer (convolutional layer, 215), a subsampling layer (220), a convolutional layer (225), a subsampling layer (230). , A fully connected (FC) layer 235, 240 and an output layer 245. In the example shown in FIG. 4, the input layer 210 is configured to accept a 32×32 pixel image, and the convolution layer 215 generates six 28×28 feature maps from the input layer.

이와 같이, 도 4에는 특정한 CNN의 구성이 도시되어 있지만, 보다 일반적으로 CNN은 각각 서브 샘플링 단계를 가지는 하나 또는 그 이상의 컨볼루션 계층(convolutional layer)과, 그리고 하나 또는 그 이상의 전연결 계층(fully connected layer)을 포함하여 구성된다. 일반적으로, 도시된 CNN 아키텍쳐는 입력 이미지의 2차원 구조를 이용하기 위하여 고안되었다. 예를 들어, CNN은 국부적인 연결과, 특정 형태의 풀링이 뒤따르는 연결된 가중치를 이용하여 이를 달성한다. 일반적으로, 유사한 수의 은닉 유닛을 갖는 전연계 네트워크와 비교하여, CNN은 훈련하기가 보다 쉬우며, 보더 적은 파라미터를 가지는 경향이 있다.As such, although the configuration of a specific CNN is shown in FIG. 4, more generally, a CNN includes one or more convolutional layers each having a sub-sampling step, and one or more fully connected layers. layer). In general, the illustrated CNN architecture was devised to use a two-dimensional structure of an input image. For example, CNN achieves this by using local connections and connected weights followed by some form of pooling. In general, compared to all-connected networks with a similar number of hidden units, CNNs are easier to train and tend to have fewer parameters.

일반적으로, CNN은 컨볼루션 계층과 서브샘플링 계층, 이에 뒤따르는 전연결 계층을 포함한다. 일 실시예에 따르면, CNN은 컨볼루션 계층에서 x×y×z의 이미지를 입력으로 받아들이는데, 여기서 x와y는 각각 이미지의 높이와 폭을 나타내며, z는 이미지에서의 채널을 나타낸다. 예컨대, RGB 이미지는 z=3의 채널을 가진다. 컨볼루션 계층은 a×b×c 크기의 필터(커널)을 포함할 수 있는데, 여기서 a×b는 x×y보다 작고, c는 z보다 작거나 같다. 일반적으로, 필터 k의 크기가 국부적으로 연결된 구조를 초래하는데, 이것은 이미지와 컨볼루션되어서 k개의 특성맵을 생성한다. 또한, 각 특성맵은 다양한 크기의 인접한 영역에 걸쳐서 서브샘플링된다.In general, the CNN includes a convolutional layer and a subsampling layer, followed by a full connection layer. According to an embodiment, the CNN receives an image of x×y×z as an input in the convolutional layer, where x and y represent the height and width of the image, respectively, and z represents a channel in the image. For example, an RGB image has a z=3 channel. The convolutional layer may include a filter (kernel) of size a×b×c, where a×b is less than x×y and c is less than or equal to z. In general, the size of filter k results in a locally linked structure, which is convolved with the image to produce k feature maps. In addition, each feature map is subsampled over adjacent regions of various sizes.

심층 신경망이 CNN인 경우에, 신경망 모델 정보로는 파라미터 유형 정보, 파라미터 차원 정보, 파라미터 인덱스 정보를 포함한다. 파라미터 유형 정보는 모델 구조에 의하여 정의되는 파라미터 텐서의 명칭과 함께 5개의 스트링 중의 어느 하나에 대응하는 값을 갖는다. 5개의 스트링은 '컨볼루션 계층 가중치', '전연결 계층 가중치', '컨볼루션 계층 바이어스', '전연결 계층 바이어스' 및 '기타'이다. 파라미터 차원 정보는 모델 구조에 의하여 정의되는 파라미터 텐서의 명칭과 함께 해당 파라미터 텐서의 차원에 대응하는 값을 갖는다. 파라미터 인덱스 정보는 모델 구조에 의하여 정의되는 파라미터 텐서의 명칭과 함께 해당 파라미터 텐서의 인덱스에 대응하는 값을 갖는다.When the deep neural network is a CNN, the neural network model information includes parameter type information, parameter dimension information, and parameter index information. The parameter type information has a value corresponding to any one of five strings along with the name of the parameter tensor defined by the model structure. The five strings are'Convolution Layer Weight','All Connection Layer Weight','Convolution Layer Bias','All Connection Layer Bias' and'Other'. The parameter dimension information has the name of the parameter tensor defined by the model structure and a value corresponding to the dimension of the corresponding parameter tensor. The parameter index information has the name of the parameter tensor defined by the model structure and a value corresponding to the index of the corresponding parameter tensor.

계속해서 도 3을 참조하면, 파라미터 축소 유닛(22)은 심층 신경망을 기술하는 파라미터들의 집합을 감소시키는 과정을 수행한다. 파라미터 축소 유닛(22)은 가지치기(pruning) 등과 같은 성김화(sparsity) 기법 및/또는 행렬 분해(matrix decomposition) 기법 등이 적용할 수 있다. 성김화 기법은, 가중치를 '0'으로 설정하여 가중치들의 빈도를 희박하게 하거나 또는 신경망의 연결관계를 잘라서 간소화하는 것을 가리킨다. 그리고 행렬 분해 기법은, 신경망의 파리미터 텐서(parameter tensor), 즉 가중치 매트릭스를 보다 작은 크기의 매트릭스들로 분해하는 것을 가리킨다. 3, the parameter reduction unit 22 performs a process of reducing a set of parameters describing the deep neural network. The parameter reduction unit 22 may use a sparsity technique such as pruning and/or a matrix decomposition technique. The sparse technique refers to simplifying the frequency of the weights by setting the weights to '0' or by cutting the connection of the neural network. And the matrix decomposition technique refers to the decomposition of the neural network's parameter tensor, that is, the weight matrix into smaller matrices.

파라미터 축소 유닛(22)은 이러한 성김화 기법과 행렬 분해 기법은 연속적으로 적용(예컨대, 먼저 가중치들을 성김화한 후에, 성김화된 가중치들의 행렬을 분해함)하거나 또는 둘 중에서 어느 하나의 기법만을 적용할 수도 있다. 또한, 파라미터 축소 유닛(22)에서 적용된 방법을 지시하는 정보는 출력되어 압축된 비트스트림에 포함될 수 있다. The parameter reduction unit 22 continuously applies the sparse technique and the matrix decomposition technique (e.g., first sparse the weights, then decompose the matrix of sparse weights) or only one of the two techniques. You may. In addition, information indicating the method applied by the parameter reduction unit 22 may be output and included in the compressed bitstream.

파라미터 축소 유닛(22)에 의하여 심층 신경망을 기술하는 파라미터들의 집합을 감소시킴으로써, 심층 신경망 압축 장치(100)는 물론 이에 의하여 압축된 심층 신경망을 복원하여 사용하는 장치, 예컨대 모바일 장치 등은, 곱셈연산(multiplication) 및 메모리 부하를 줄일 수 있다. 다만, 파라미터들의 집합을 감소시키는 것은 손실 과정(lossy process)에 해당되므로, 복원된 심층 신경망의 성능이 어느 정도 저하될 수 밖에 없다. 따라서 파라미터 축소 유닛(22)는 복원될 심층 신경망의 성능 저하를 최소화하면서, 가중치 매트릭스의 크기를 줄일 수 있는 것이 바람직하다.By reducing the set of parameters describing the deep neural network by the parameter reduction unit 22, the deep neural network compression device 100 as well as a device that reconstructs and uses the deep neural network compressed thereby, such as a mobile device, can perform multiplication operation. (multiplication) and memory load can be reduced. However, since reducing the set of parameters corresponds to a lossy process, the performance of the restored deep neural network inevitably degrades to some extent. Therefore, it is preferable that the parameter reduction unit 22 can reduce the size of the weight matrix while minimizing performance degradation of the deep neural network to be restored.

통상적으로, 훈련된 심층 신경망의 압축 과정에서, 파라미터 축소 과정은 양자화 과정의 전처리의 관점에서 수행하는 것으로, 임의적인 과정이다. 따라서 입력 데이터(최초 심층 신경망 모델을 기술하는 파라미터)는, 파라미터 축소 유닛(22)을 거치지 않고 바로 양자화 유닛(24)으로 입력될 수도 있다. 다만, 입력 데이터가 파라미터 축소 유닛(22)으로 입력되는 경우에는, 파라미터 축소 과정의 수행과 관련된 정보(메타데이터 등)가 생성되어서, 신경망 복원을 위해 다른 데이터와 함께 부호화되어 비트스트림에 포함될 수 있다. 예컨대, 가지치기 기법 및/또는 매트릭스 분해 기법의 수행 여부를 지시하는 정보(플래그)와 함께 해당 기법의 수행과 관련된 정보들이 메타데이터로서, 파라미터 축소 유닛(22)의 출력이 될 수 있는데, 이에 대해서는 후술한다.Typically, in the compression process of the trained deep neural network, the parameter reduction process is performed from the viewpoint of preprocessing of the quantization process, and is an arbitrary process. Accordingly, input data (parameters describing the initial deep neural network model) may be directly input to the quantization unit 24 without passing through the parameter reduction unit 22. However, when input data is input to the parameter reduction unit 22, information (metadata, etc.) related to the execution of the parameter reduction process is generated, and may be encoded together with other data for neural network restoration and included in the bitstream. . For example, information (flags) indicating whether or not the pruning technique and/or the matrix decomposition technique is performed, as well as information related to the execution of the technique, may be output of the parameter reduction unit 22 as metadata. It will be described later.

파라미터 축소 유닛(22)으로의 입력 데이터는, 특정한 심층 학습 프레임워크에 의하여 특정되는 신경망을 기술하는 파라미터들과 아키텍쳐(구조 정의)이다. 그리고 파라미터 축소 유닛(22)으로부터의 출력 데이터도, 상기 입력 데이터와 동일하다. 이러한 입력 데이터와 출력 데이터를 기술하는 포맷(representation format)에는 특별한 제한은 없는데, Keras/Tensorflow 또는 Pytorch 등이 적용될 수 있다.The input data to the parameter reduction unit 22 are parameters and architecture (structure definition) that describe the neural network specified by a specific deep learning framework. And the output data from the parameter reduction unit 22 is also the same as the input data. There is no particular limitation on the format for describing input data and output data, but Keras/Tensorflow or Pytorch may be applied.

파라미터 축소 유닛(22)에서 수행될 수 있는 하나의 과정인 '성김화 또는 가지치기 과정'에서는 학습된 심층 신경망들을 감소된 파라미터 크기로 만든다. 가지치기는 각각의 단위, 예컨대 하나의 가중치나 채널 단위에서의 가중치의 중요도를 구하고, 중요도가 떨어지는 가중치를 '0'으로 하거나 또는 연결관계를 제거(파라미터값을 0으로 함)한다. 또한, 연결관계를 제거하는 것은 노드 또는 가중치 하나 단위로 연결을 끊는 것은 물론, 필터 단위 등 더 큰 단위로 수행될 수도 있다.In the'spreading or pruning process', which is one process that can be performed in the parameter reduction unit 22, the learned deep neural networks are made to have a reduced parameter size. In pruning, the importance of the weight in each unit, such as one weight or channel unit, is obtained, and the weight of which importance is lowered is set to '0' or the connection relationship is removed (the parameter value is set to 0). In addition, removing the connection relationship may be performed in a larger unit such as a filter unit, as well as disconnecting the connection by one node or weight.

이러한 가지치기 과정에서는, 원래의 성능을 유지하는 동안에 각각의 신경망들의 레이어가 가능한 네트워크들을 가지치기하도록, 좋은 임계값을 설정하는 것에서 문제가 발생한다. 다수의 레이어들로 구성된 신경망들에 대해, 특히 한 계층의 임계값이 다른 계층의 임계값들에 종속적일 수 있다는 것을 고려하면, 임계값을 찾기 위한 브루트 포스는 실용적이지 않을 수 있다. 또한, 가지치기는 원래의 성능을 회복하기 위해 네트워크의 재훈련을 요구할 수 있다. 가지치기 과정이 효율적이라고 확인되기 위해 상당한 시간이 소요될 수 있다. 본 명세서에서 설명되듯이, 다양한 실시 예들에서, 네트워크를 재훈련하기 위한 방법들과 함께 임계값들의 자동 선택은 파라미터들을 줄이기 위해 신경망을 가지치기하는데 사용될 수 있다.In this pruning process, a problem arises in setting a good threshold so that each layer of neural networks prunes possible networks while maintaining the original performance. For neural networks composed of multiple layers, particularly considering that a threshold value of one layer may be dependent on threshold values of another layer, a brute force to find a threshold value may not be practical. Also, pruning may require retraining of the network to restore its original performance. It can take a significant amount of time to confirm that the pruning process is efficient. As described herein, in various embodiments, automatic selection of thresholds along with methods for retraining the network may be used to prun the neural network to reduce parameters.

신경망 내에서 약한 반응(weak response) 또는 불필요한 파라미터를 가진 뉴런을 잘라내기 위하여, 해당 파라미터의 중요성을 판정하기 위한 기준은 알고리즘에 따라서 변경될 수 있다. 이러한 기준은 도 5에 도시된 바와 같이, 목적 함수(objective function)로 정의될 수 있다. 도 5를 참조하면, 가지치기 기법은 다음의 과정으로 진행될 수 있다: 데이터 유형 결정(data type decision) -> 목적 함수 결정 및 파리미터 입력(object function decision & parameter input) -> 결정값을 가지치기(pruning decision value) -> 가지치기된 신경망 모델(pruned NN model).In order to cut out neurons with weak responses or unnecessary parameters in the neural network, the criteria for determining the importance of the parameters can be changed according to the algorithm. This criterion may be defined as an objective function, as shown in FIG. 5. Referring to FIG. 5, the pruning technique may proceed as follows: data type decision -> object function decision & parameter input -> pruning a decision value ( pruning decision value) -> pruned NN model.

데이터 유형 결정 과정에서는 어떤 유닛(개별 또는 채널 단위)이 가지치기될 것인지를 판정한다. 그리고 목적 함수 결정 및 파리미터 입력 과정에서, 목적 함수는 파라미터들의 중요성을 가지고 목적 함수가 정의된다. 목적 함수를 정의하기 위한 파라미터들로는, 가중치 임계치(weight value threshold), 스케일링 팩터(scaling factor), 기울기(gradient) 등이 있다. 그리고 외적 파라미터들로는 가지치기 비율(pruning rate) 및/또는 임의적이지만 재훈련을 위한 하이퍼파리미터(hyperparameter for re-training) 등이 있다. In the data type determination process, it is determined which units (individually or channel units) are to be pruned. And in the process of determining the objective function and inputting parameters, the objective function is defined with the importance of parameters. Parameters for defining the objective function include a weight value threshold, a scaling factor, and a gradient. And external parameters include pruning rate and/or hyperparameter for re-training, although arbitrary.

가지치기 기법은 크게 학습 여부에 따라 2가지 방식으로 나뉠 수 있다. 첫 번째 방식은, 부가적인 학습 과정이 필요하지 않으며, 이미 학습된 가중치를 기준으로 가지치기 기법을 수행한다. 예를 들어, 가중치들의 임계치(threshold)를 미리 설정해두고서, 이에 따라서 가중치의 중요도를 판단한 다음 가지치기를 수행한다. 두 번째 방식은, 부가적인 학습 과정이 필요하며, 이 때는 학습에 필요한 하이퍼 파라미터(hyperparameter)와 학습 데이터가 입력이 되어야 한다. 또한, 가지치기 기법은 추가적으로 압축율도 같이 설정할 수 있으며, 목적 함수(objective function)는 가지치기 기법의 방법을 나타낸다. 예를 들어, 두 번째 가지치기 방식은, 학습된 가중치을 유지한 상태로, 스케일링 팩터(scaling factor)만을 학습하는 방법과 지정한 성능(task-based loss)에 최대한 가까워지는 임계치(threshold)만을 학습하는 방법이 있다.The pruning technique can be largely divided into two methods depending on whether or not it is learned. The first method does not require an additional learning process, and performs the pruning technique based on the weights that have already been learned. For example, a threshold of weights is set in advance, the importance of weights is determined accordingly, and then pruning is performed. The second method requires an additional learning process, and in this case, the hyperparameter and learning data required for learning must be input. In addition, the pruning technique can additionally set the compression rate as well, and the objective function represents the method of the pruning technique. For example, the second pruning method is a method of learning only a scaling factor while maintaining the learned weight, and a method of learning only a threshold that is as close as possible to a specified task-based loss. There is this.

이러한 가지치기 기법과 관련하여, 전술한 파라미터들이나 설정된 임계치 등은, 신경망의 복원을 위한 메타데이터로서, 신경망 복원을 위한 메타데이터로서 출력될 수 있다. 예를 들어, 가지치기 기법에서 추가적으로 필요한 마스크 정보(가중치가 0인 위치를 나타내는 정보) 및/또는 임계치 정보 등이, 가지치기 또는 성김화 과정의 메타데이터로서 출력될 수 있다. 또한, 해당 과정에서 실제로 파라미터의 개수가 감소하였는지에 대한 정보(예컨대, 복원의 신경망 모델의 차원(dimension)값과 압축 이전의 신경망 모델의 차원(dimension)값이 동일한지 여부를 지시하는 플래그)도 가지치기 또는 성김화 과정의 메타데이터로서 출력될 수 있다.In connection with such a pruning technique, the above-described parameters, a set threshold, etc. may be output as metadata for reconstructing a neural network, and as metadata for reconstructing a neural network. For example, mask information (information indicating a position with a weight of 0) and/or threshold information additionally required in the pruning technique may be output as metadata of a pruning or sparse process. In addition, there is information on whether the number of parameters actually decreased during the process (e.g., a flag indicating whether the dimension value of the reconstruction neural network model and the dimension value of the neural network model before compression are the same). It can be output as metadata of the striking or sparing process.

파라미터 축소 유닛(22)에서 수행될 수 있는 다른 하나의 과정인 '매트릭스 분해 과정'에서는, 최초 가중치 매트릭스인 1개의 가중치 매트릭스(weight matrix)를 N(N은 2 이상의 정수)개의 저차원 가중치 매트릭스로 분해한다. 그 결과, 1개 계층(layer)의 인덱스마다 N개의 인덱스가 생성되며, 원래 모델과는 다른 모델이 생성된다. 이에 따라, 가중치 파라미터의 수를 줄일 수 있을 뿐만 아니라, 런타임 감수 효과도 기대할 수 있다. 이러한 매트릭스 분해 기법은, 일반적으로 완전히 연결된 계층(fully-connected layer)과 컨벌루션 계층(convolution layer)에 적용이 가능하다.In the'matrix decomposition process', another process that can be performed in the parameter reduction unit 22, one weight matrix, which is an initial weight matrix, is converted into N (N is an integer greater than or equal to 2) low-dimensional weight matrix. Disassemble. As a result, N indexes are created for each index of one layer, and a model different from the original model is created. Accordingly, not only can the number of weight parameters be reduced, but also a runtime reduction effect can be expected. This matrix decomposition technique is generally applicable to a fully-connected layer and a convolution layer.

매트릭스 분해 기법의 일례는 하위 랭크(Low Rank, LR) 근사화 처리 또는 하위 디스플레이스먼트 랭크(Low Displacement Rank, LDR) 근사화 처리이다. 하위 랭크 근사화 처리 또는 하위 디스플레이스먼트 랭크 근사화 처리에 의하면, 다차원의 매트릭스 형식으로 표현되는 신경망의 가중치들을 그 보다 차원이 낮은 다수 개의 매트릭스, 예컨대 행 매트릭스 및/또는 열 매트릭스로 표현함으로써, 계산량을 줄이는 것이 가능하다. An example of a matrix decomposition technique is a low rank (LR) approximation process or a low displacement rank (LDR) approximation process. According to the lower rank approximation process or the lower displacement rank approximation process, the weights of a neural network expressed in a multidimensional matrix form are expressed in a plurality of lower-dimensional matrices, such as a row matrix and/or a column matrix, thereby reducing the amount of computation. It is possible.

예컨대, 컨볼루션 계층에 적용되는 하위 랭크 근사화는 다음과 같이 수행될 수 있다: 2D-CNN에서 훈련된 컨볼루션 가중치는 4D 유형(필터 크기, 필터 크기, 입력 채널 수, 출력 채널 수)이다. 수학식 1은 각 채널의 근사 방정식을 보여준다. W_st의 차원은 d x d로 표현되는데, 여기서 d는 컨볼루션 계층의 필터 크기다. 또한, 하위 랭크로 근사될 수 있는 매트릭스 Us, Vt의 차원은 d x R로 표현된다. 여기서 R은 하위 랭크 근사화를 위한 크기의 목표치이다. For example, the lower rank approximation applied to the convolutional layer can be performed as follows: The convolution weight trained in 2D-CNN is of 4D type (filter size, filter size, number of input channels, number of output channels). Equation 1 shows the approximate equation for each channel. The dimension of W_st is expressed as d x d, where d is the filter size of the convolutional layer. In addition, the dimensions of the matrices Us and Vt that can be approximated by the lower rank are expressed by d x R. Here, R is the target value for the lower rank approximation.

하위 랭크 근사화의 일례에 의하면, 가중치들의 데이터 양을 감소하기 위하여, 각 컨벌루션 계층에서 가중치들의 2D 커널 매트릭스는 2개의 1D 커널 매트릭스로 변환될 수 있다.According to an example of lower rank approximation, in order to reduce the data amount of weights, a 2D kernel matrix of weights in each convolutional layer may be transformed into two 1D kernel matrices.

따라서, 이 방법의 목적은 W_st와 (U_s×V_t)의 최소 차이를 제공하는 (U,V)를 찾는 것이다. 필터 재구성을 최적화하기 위한 비용 함수는 수학식 2와 같다.Therefore, the purpose of this method is to find (U,V) which gives the minimum difference between W _st and (U _s ×V _t ). The cost function for optimizing filter reconstruction is shown in Equation 2.

하위 랭크 근사화의 다른 예에 의하면, 가중치들의 데이터 양을 감소시키기 위하여, 각 컨벌루션 계층에서 가중치들의 2D 커널 매트릭스는 3개의 커널 매트릭스로 변환될 수 있다(예컨대, CP(CANDECOMP/PARAFAC) 분해(decomposition)). 즉, 하나의 컨볼루션 가중치들을 3개의 가중치들로 분해한다. 이러한 CP 분해의 기본 공식은 수학식 3으로 표현될 수 있다.According to another example of lower rank approximation, in order to reduce the amount of data of weights, a 2D kernel matrix of weights in each convolutional layer can be transformed into three kernel matrices (e.g., CP (CANDECOMP/PARAFAC) decomposition). ). That is, one convolution weight is decomposed into three weights. The basic formula of this CP decomposition can be expressed by Equation 3.

여기서, R은 컴포넌트의 개수를 가리키는데, 이것은 텐서 랭크(tensor rank)이다.Here, R denotes the number of components, which is a tensor rank.

일반적으로, 2D-CNN에서 존재하는 훈련된 컨벌루션 계층의 가중치는 4D 유형이다. T × S × D × D의 크기를 갖는 컨벌루션 커널 텐서 K는, 출력 채널에 대응하는 T, 입력 채널에 대응하는 S 및 커널 크기에 대응하는 D를 포함한다. In general, the weight of the trained convolutional layer existing in 2D-CNN is 4D type. The convolutional kernel tensor K having a size of T × S × D × D includes T corresponding to the output channel, S corresponding to the input channel, and D corresponding to the kernel size.

CP 분해에서는 컨벌루션 커널 텐서 K를 랭크 R로 근사화한다. 이것은 다음의 수학식 4로 표현될 수 있다. In CP decomposition, the convolutional kernel tensor K is approximated by rank R. This can be expressed by the following equation (4).

여기서,

및

은 각각 크기

및

를 갖는 3개의 컴포넌트이다. here,

And

Each size

And

It is three components with

따라서 분해된 결과는 다음의 수학식 5와 같이 매핑될 수 있다.Therefore, the decomposed result can be mapped as shown in Equation 5 below.

일반적인 컨벌루션의 경우에, 컨벌루션 연산을 입력 채널 × 출력 채널의 수만큼 해야 하지만, depthwise 컨벌루션의 경우에는 출력 채널의 수만큼 연산을 수행하면 된다. In the case of general convolution, the number of convolution operations should be performed as much as the number of input channels x output channels, but in the case of depthwise convolution, the number of output channels may be performed.

이 때, 필터 재구성을 최적화하기 위한 비용함수는 다음의 수학식 6과 같다. At this time, the cost function for optimizing the filter reconstruction is shown in Equation 6 below.

여기서,

이다.here,

to be.

이러한 실시예에 따른 하위 랭크 근사화 방법에 의하면, 가중치 파라미터의 수를 줄일 수 있을 뿐만 아니라 연산 속도도 향상시킬 수가 있다.According to the lower rank approximation method according to this embodiment, not only can the number of weight parameters be reduced, but also computation speed can be improved.

일반적으로, 매트릭스 분해 과정에서의 출력은, 행렬 A, B(여기서, 행렬 A, B는 매트릭스 분해에 사용되는 순환 오퍼레이터(circulant operator)인 매트릭스 인자(matrix factor)), 랭크값(rank value, 정수 기반), 필터 재구성을 위한 리쉐이핑 모드(reshaping mode, 정수 기반), 최초 가중치 매트릭스의 차원 및 형상(dimension and shape) 등이 있다. In general, the outputs in the matrix decomposition process are matrices A and B (here, matrices A and B are matrix factors, which are circulant operators used for matrix decomposition), rank values, integers. Base), a reshaping mode for filter reconstruction (integer-based), and the dimensions and shape of the initial weight matrix.

이 중에서, 리쉐이핑 모드(W')는 하나 또는 복수의 모드를 가질 수 있다. 일례로, 리쉐이핑 모드(W')는 수학식 7과 같은 4개의 모드를 가질 수 있는데, 이것은 단지 예시적인 것이다. 그리고 전술한 바와 같이, 매트릭스 분해 과정에서 어떠한 모드를 적용했는지 여부는 리쉐이핑 모드 정보로서, 신경망 복원을 위해 전송되어야 한다. Among them, the reshaping mode W'may have one or a plurality of modes. As an example, the reshaping mode (W') may have four modes as in Equation 7, which is only exemplary. And, as described above, whether a mode is applied in the matrix decomposition process is reshaping mode information, and must be transmitted for neural network restoration.

수학식 8은 하위 랭크 근사화 기법을 위한, 전연결 계층의 가중치 매트릭스를 기술하는 것이다. 또한, 수학식 4처럼 리쉐이핑을 하지 않고, 행렬 분해 기법을 적용할 수 있으므로(수학식 3 ~ 수학식 5 참조), 리쉐이핑의 적용 여부에 대한 정보도 메타데이터에 포함될 수 있다.Equation 8 describes a weight matrix of a full connection layer for a lower rank approximation technique. In addition, since the matrix decomposition technique can be applied without reshaping as in Equation 4 (refer to Equations 3 to 5), information on whether reshaping is applied may also be included in the metadata.

수학식 9는 하위 랭크 근사화 기법을 위한, 컨볼루션 계층의 가중치 매트릭스를 기술하는 것이다.Equation 9 describes a weight matrix of a convolutional layer for a lower rank approximation technique.

하위 디스플레이스먼트 근사화 기법은, 분해 과정은 하위 랭크 근사화 기법과 동일하지만, 분해하는 과정의 입력은 하위 랭크 근사화 기법과 상이하다. 수학식 10은 하위 디스플레이스먼트 근사화 기법을 기술하는 것이다. In the lower displacement approximation technique, the decomposition process is the same as the lower rank approximation technique, but the input to the decomposition process is different from the lower rank approximation technique. Equation 10 describes a lower displacement approximation technique.

여기서, f값은 임의로 설정된다. 그리고 신경망 모델의 복원을 위해서는 행렬 A, B도 매트릭스 분해 유닛의 출력이 되어야 하는 것을 알 수 있다.Here, the f value is set arbitrarily. And it can be seen that in order to reconstruct the neural network model, matrices A and B must also be outputs of the matrix decomposition unit.

이러한 매트릭스 분해 과정에서의 출력은, 해당 파라미트 매트릭스가 전연결(FC) 계층에 대한 것인지 또는 컨볼루션 계층에 대한 것인지에 따라 달라질 수 있다. 그리고 매트릭스 분해 유닛의 출력은, 적용되는 기법이 하위 랭크 근사화 기법인지 또는 하위 디스플레이스먼트 근사화 기법인지에 따라서도 달라질 수 있다. The output of the matrix decomposition process may vary depending on whether the corresponding parametric matrix is for a full connection (FC) layer or a convolutional layer. In addition, the output of the matrix decomposition unit may vary depending on whether the applied technique is a low rank approximation technique or a low displacement approximation technique.

이것의 일례는 도 6에 도시되어 있다. 만일, 전연결 계층에 대하여 하위 랭크 근사화 기법이 적용되는 경우(LR & FC layers)에는, 매트릭스 분해 과정에서의 출력은, 해당 과정에서 실제로 파라미터의 개수가 감소하였는지에 대한 정보(예컨대, 복원의 신경망 모델의 차원(dimension)값과 압축 이전의 신경망 모델의 차원(dimension)값이 동일한지 여부를 지시하는 플래그), 랭크값, Reshape_mode=none, 최초 행렬의 차원 등이 포함될 수 있다. 대신에, 컨볼루션 계층에 대하여 하위 랭크 근사화 기법이 적용되는 경우(LR & Conv layers)에는, 매트릭스 분해 과정에서의 출력은, 해당 과정에서 실제로 파라미터의 개수가 감소하였는지에 대한 정보(예컨대, 복원의 신경망 모델의 차원(dimension)값과 압축 이전의 신경망 모델의 차원(dimension)값이 동일한지 여부를 지시하는 플래그), 랭크값, Reshape_mode= 1 (or 2, 3, 4), 최초 행렬의 차원 등이 포함될 수 있다. 또한, 컨볼루션 계층에 대하여, 하위 디스플레이스먼트 근사화 기법이 적용되는 경우(LDR & Conv layers)에는, 매트릭스 분해 과정에서의 출력은, 해당 과정에서 실제로 파라미터의 개수가 감소하였는지에 대한 정보(예컨대, 복원의 신경망 모델의 차원(dimension)값과 압축 이전의 신경망 모델의 차원(dimension)값이 동일한지 여부를 지시하는 플래그), 랭크값, Reshape_mode= 1 (or 2, 3, 4), 인자(factor) A, B, 최초 행렬의 차원 등이 포함될 수 있다.An example of this is shown in FIG. 6. If the lower rank approximation technique is applied to all connection layers (LR & FC layers), the output from the matrix decomposition process is information about whether the number of parameters actually decreased during the process (e.g., a neural network model of reconstruction A flag indicating whether the dimension value of is the same as the dimension value of the neural network model before compression), a rank value, Reshape_mode=none, the dimension of the initial matrix, etc. may be included. Instead, when the lower rank approximation technique is applied to the convolutional layer (LR & Conv layers), the output from the matrix decomposition process is information about whether the number of parameters actually decreased during the process (e.g., the neural network of reconstruction). Flag indicating whether the dimension value of the model and the dimension value of the neural network model before compression are the same), rank value, Reshape_mode= 1 (or 2, 3, 4), the dimension of the initial matrix, etc. Can be included. In addition, when the lower displacement approximation technique is applied to the convolution layer (LDR & Conv layers), the output from the matrix decomposition process is information about whether the number of parameters actually decreased during the process (e.g., restoration Flag indicating whether the dimension value of the neural network model of is equal to the dimension value of the neural network model before compression), rank value, Reshape_mode= 1 (or 2, 3, 4), factor A, B, dimensions of the initial matrix, etc. may be included.

매트릭스 분해 과정의 일 실시형태에 의하면, k번째 계층이 컨볼루션 계층이면, 최초 가중치 매트릭스는 4차원 가중치 매트릭스이다. 이러한 가중치 매트릭스의 차원은 입력 채널의 수, 출력 채널의 수 및 2차원 필터 커널의 크기를 이용하여 분해될 수 있다. 이에 의하면, 4차원 가중치 매트릭스는 복수의 2차원 가중치 매트릭스로 분해될 수 있으며, 생성되는 2차원 매트릭스의 크기 및 구성요소는 리쉐이핑 모드에 기초하여 결정될 수 있다. According to one embodiment of the matrix decomposition process, if the k-th layer is a convolutional layer, the initial weight matrix is a four-dimensional weight matrix. The dimension of the weight matrix can be decomposed using the number of input channels, the number of output channels, and the size of the 2D filter kernel. Accordingly, the 4D weight matrix may be decomposed into a plurality of 2D weight matrices, and the size and component of the generated 2D matrix may be determined based on the reshaping mode.

계속해서 도 3을 참조하면, 양자화 유닛(24)은 파라미터 축소 유닛(22)의 출력들, 예컨대 복수개의 저차원 가중치 매트릭스들에 대한 양자화를 수행한다. 이를 위하여, 입력되는 양자화 비트(input quantization bits)에 기초하여 양자화를 수행하는데, 최대/최소값(Max/Min vaules)을 추출하여 양자화된 가중치를 출력한다. 도 7에는 양자화 유닛(24)에 의한 균일 양자화 과정의 일례가 도시되어 있다.With continued reference to FIG. 3, the quantization unit 24 performs quantization on outputs of the parameter reduction unit 22, for example, a plurality of low-dimensional weight matrices. To this end, quantization is performed based on input quantization bits, and a quantized weight is output by extracting a maximum/minimum value (Max/Min vaules). 7 shows an example of a uniform quantization process by the quantization unit 24.

본 실시예의 일 측면에 의하면, 양자화를 위하여 입력되는 파라미터들의 분포를 고려함으로써 양자화 오차(quantization error)를 감소시키기 위하여, 양자화 유닛(24)은 적응적 양자화(adaptive quantization)를 수행할 수도 있다. 적응적 양자화 과정에서는 압축된 정수 가중치 파일과 코드북이 입력되며, 양자화 레벨(quantization level, r_k) 및 양자화 영역 경계(quantization region boundary, d_k)는 수학식 11로 표현될 수 있다. According to an aspect of the present embodiment, in order to reduce a quantization error by considering a distribution of parameters input for quantization, the quantization unit 24 may perform adaptive quantization. In the adaptive quantization process, a compressed integer weight file and a codebook are input, and a quantization level (r _k ) and a quantization region boundary (d _k ) may be expressed by Equation 11.

이러한 적응적 양자화 과정의 일례는 도 8에 도시되어 있다.An example of such an adaptive quantization process is shown in FIG. 8.

도 8에 도시된 것과 같은 적응적 양자화 과정에서, 만일 양자화 오차가 충분히 낮지 않은 경우에는, 불균일 양자화가 균일 양자화로 대체될 수 있다. 이에 의하면, 입력은 계층 수를 지시하는 구성 파일을 포함한다. 그리고 만일 양자화 오차의 크기가 균일 양자화보다 큰 경우에는, 불균일 양자화 대신에 균일 양자화가 사용될 수 있다.In the adaptive quantization process as shown in FIG. 8, if the quantization error is not low enough, non-uniform quantization can be replaced with uniform quantization. According to this, the input includes a configuration file indicating the number of layers. And, if the size of the quantization error is larger than the uniform quantization, uniform quantization may be used instead of non-uniform quantization.

도 8에 도시된 양자화 과정은, 뉴럴넷의 가중치 값을 양자화한 값(integer, 정수)을 이진화 형태로 저장하는 기존의 양자화 과정을 추가/보완하였다. 도 8에 도시된 양자화 과정에 의하면, 손실 코딩(Lossy coding)은 양자화 뿐만 아니라 다른 기법을 사용할 수도 있으며, 예를 들면 가중치 값 행렬에서 분할적으로 코딩하는 방법일 수 있다(부호화 효율이 좋은(RD(Rate distortion) 등으로 추정) 분할 맵을 결정). 추가적으로, 가중치 값들을 양자화한 값들을 무손실 코딩인 엔트로피 코딩(산술코딩, CABAC, 팔레트, 인덱스맵코딩 등)으로 이진화 파일을 만들어낸다. 또한 디코딩(or decompression) 과정에서는 이진화 파일 (bitstream)등을 입력값을 두면 복원(reconstruction)을 진행하는데, 이 때 뉴럴넷 모델 복원(reconstruction)을 전체를 진행할 지 또는 일부만 수행할 것인지에 대한 정보도 포함될 수 있다.The quantization process shown in FIG. 8 adds/complements the existing quantization process of storing a quantized value (integer, integer) of a neuralnet weight value in a binary format. According to the quantization process shown in FIG. 8, lossy coding may use not only quantization but also other techniques, and may be, for example, a method of dividing coding in a weight value matrix (with good coding efficiency (RD (Estimated by rate distortion), etc.) Determine the segmentation map). In addition, the quantized values of the weight values are generated by entropy coding (arithmetic coding, CABAC, palette, index map coding, etc.), which is lossless coding. In addition, in the process of decoding (or decompression), reconstruction is performed by inputting a binary file (bitstream), etc., at this time, information on whether to perform the neuralnet model reconstruction in its entirety or only part of it may be included. have.

일 실시예에 의하면, 양자화 유닛(24)으로의 입력은, 예컨대 numpy.float32 포맷의 넘파이 어레이(numpy array), 예컨대 가중치 행렬(일반적으로, 텐서 형태) 어레이로 표현되는 아이템으로, 네트워크의 모든 파라미터 텐서와 키(key)로서 그들 각각의 명칭을 담고 있는 사전(dictionary) 형태이다. 각 파라미터 텐서의 명칭은, 해당 모델의 코디네이터에 의하여 제공되는 각각의 모델 정의에 의하여 특정되는 명명법(예컨대, 테스트 모델에서의 명명법)과 일치하여야 한다. According to an embodiment, the input to the quantization unit 24 is, for example, an item represented by a numpy array of numpy.float32 format, for example an array of weights matrix (generally, in the form of a tensor). It is in the form of a dictionary containing the name of each of them as parameter tensors and keys. The name of each parameter tensor must match the nomenclature specified by each model definition provided by the coordinator of the model (eg, nomenclature in the test model).

양자화 유닛(24)으로부터의 출력은, 세 개의 키, 즉 " parameters", "int_parameters" 및 "metadata"를 담고 있다. 여기서, "parameters"는, 근사화 과정 동안에 수정되지 않는(즉, 근사화 과정과 상관 없는 파라미터) 모든 파라미터들을 값(values)으로, 그리고 그들 각각의 이름을 키(keys)로 가지고 있는 사전 형태(float-point로 저장)를 포함한다(예컨대, layer_type). 이것들은 예컨대 numpy.int32 포맷의 넘파이 어레이(numpy array)로 표현될 수 있다. 그리고 "int_parameters"는, 근사화 과정을 통해 생기는 정수(interger) 기반의 파라미터들, 즉 근사화 과정 동안에 수정되는 모든 파라미터 텐서들로서, 그들 각각의 정수 표현(integer representation)은 아이템으로, 그리고 그들 각각의 명칭은 키로 표현되는 사전을 포함한다. The output from quantization unit 24 contains three keys: "parameters", "int_parameters" and "metadata". Here, "parameters" is a float-in which all parameters that are not modified during the approximation process (i.e., parameters irrelevant to the approximation process) are all parameters as values and their respective names are keys as keys. (stored as point) (e.g., layer_type). These can be represented as numpy arrays in the numpy.int32 format, for example. And "int_parameters" are integer-based parameters generated through the approximation process, that is, all parameter tensors that are modified during the approximation process, their respective integer representation as an item, and their respective names It contains a dictionary represented by keys.

"metadata"는 근사화 과정 동안에 생성되는 모든 메타데이터를 포함(저장)한다. 이러한 메타데이터는, 압축된 심층 신경망의 복호/재구성에서 요구되는 모든 필요한 정보를 포함한다. 즉, 복호/재구성 과정에서는, 예컨대 네트워크의 모든 numpy.int32 포맷의 파라미터 텐서들을 numpy.float32 포맷으로 표현되도록 할 수 있어야 한다. 메타데이터들은 모든 파라미터들(전역, global)에 의하여 공유되거나(예를 들어, 양자화를 수행할 때 발생하는 코드북이나 step_size 등) 또는 특정 파라미터들(국지, local)에 의하여 공유될 수 있다(예를 들어, per_layer 파라미터처럼 레이어 특성마다 바뀌는 메타데이터들이 해당됨). 후자의 경우에, 재구성 과정을 위하여, 파라미터 명칭과 메타데이터 사이의 개별적인 매핑이 제공되어야 한다. 서로 다른 근사화 기법 사이의 유연성을 확보하기 위하여, 메타데이터의 구체적인 포맷은, 사용되는 특정한 기법에 따라서 임의로 선택될 수 있다. 그러나 일관성의 보장을 위하여, 사전 구조가 바람직하다."metadata" includes (stores) all metadata created during the approximation process. This metadata includes all necessary information required for decoding/reconstruction of the compressed deep neural network. That is, in the decoding/reconfiguration process, it should be possible to express parameter tensors of all numpy.int32 format of the network in numpy.float32 format, for example. Metadata can be shared by all parameters (global, global) (e.g., a codebook or step_size generated when performing quantization) or by specific parameters (local, local) (e.g. For example, metadata that changes for each layer property like the per_layer parameter is applicable). In the latter case, for the reconstruction process, a separate mapping between parameter names and metadata must be provided. In order to secure flexibility between different approximation techniques, the specific format of metadata can be arbitrarily selected according to the specific technique used. However, to ensure consistency, a dictionary structure is preferred.

메타데이터는, 예컨대 양자화 과정에서 발생하는 step_size 및 codebook 정보를 포함한다. 또한, 부가적으로 필요한 파라미터들(예를 들면, RD를 구할 때의 lambda값이나 step_size를 학습해서 구할 때 필요한 하이퍼파라미터들)도 메타데이터로 표현될 수 있다.The metadata includes, for example, step_size and codebook information generated in the quantization process. In addition, additionally necessary parameters (eg, lambda values when RD is obtained or hyperparameters necessary when learning step_size is obtained) may be expressed as metadata.

계속해서 도 3을 참조하면, 엔트로피 코딩 유닛(26)에서는 양자화된 가중치와 인덱스 각각을 소정의 알고리즘(예컨데, 엔트로피 부호화)에 따라서 부호화를 수행하며, 그 결과 압축된 심층 신경망의 비트 스트림이 출력된다. 본 실시예에 의하면, 엔트로피 부호화의 구체적인 과정에 대해서는 특별한 제한이 없으며, 당업계에서 공지된 것이라면, 엔트로피 부호화의 특성상 본질적으로 적용이 불가능한 알고리즘이 아니라면, 제한없이 적용될 수 있다.3, the entropy coding unit 26 encodes each of the quantized weights and indexes according to a predetermined algorithm (e.g., entropy encoding), and as a result, a bit stream of a compressed deep neural network is output. . According to the present embodiment, there is no particular limitation on a specific process of entropy encoding, and as long as it is known in the art, it may be applied without limitation, unless an algorithm is essentially impossible to apply due to the nature of entropy encoding.

일 실시예에 의하면, 엔트로피 코딩 유닛(26)은 인코딩을 할 수 있는 데이터 포맷과 인코딩을 할 수 없는 데이터 포맷을 식별할 수 있다. 엔트로피 코딩 유닛(26)이 해석(이해)할 수 없는 모든 데이터는, "보조 데이터(auxiliary data)"로 분류되어서, 일련화된 방식(serialized manner)으로 비트스트림으로 바로 전달된다. 엔트로피 코딩 유닛(26)의 입력은 양자화 유닛(24)의 출력과 동일한 포맷을 갖는다. 그리고 엔트로피 코딩 유닛(26)으로부터의 출력은, 예컨대 numpy.float32 포맷의 넘파이 어레이로 표현되는 비트스트림이다. 이러한 비트스트림은 입력 사전 내의 모든 아이템에 대한 정보를 담고 있다. 키 명칭이나 사전 구조 등과 같은 다른 정보들은 비트스트림으로 인코딩될 필요가 없다. According to an embodiment, the entropy coding unit 26 may identify a data format capable of encoding and a data format that cannot be encoded. All data that the entropy coding unit 26 cannot interpret (understand) is classified as "auxiliary data", and is delivered directly to the bitstream in a serialized manner. The input of the entropy coding unit 26 has the same format as the output of the quantization unit 24. And the output from the entropy coding unit 26 is, for example, a bitstream represented by a numpy array in numpy.float32 format. This bitstream contains information on all items in the input dictionary. Other information such as the key name or dictionary structure need not be encoded into the bitstream.

엔트로피 코딩 유닛(26)에서는, 예컨대 가중치 파라미터들을 왼쪽 위에서부터 스캔을 시작할 수 있다. 즉, 왼쪽에서 오른쪽으로 행 우선 방식으로 스캔한 뒤, 위쪽에서 아래쪽으로 행을 스캔한다. 그리고 양자화된 인덱스값들을 이진화하여 이진화된 형태로 저장한다. 예컨대, 양자화된 인덱스값은 "Sig_flag", "Sign_flag", "AbsGrXFlag", "MaxNumNoRem" 및/또는 "RemAbs" 값을 이용하여, 이진화하여 표현될 수 있다. 여기서, "Sig_flag"는 인덱스값이 0인지 아닌지에 대해서 나타낸다. "Sign_flag"는 0이 아닌 인덱스값의 부호를 나타낸다. "AbsGrXFlag"는 인덱스값의 크기를 나타내는 것으로서, 만약, AbsGrXFlag==1이면, 해당 인덱스값은 X값(일반적으로 2의 지수승으로 표현됨)보다 큰 것을 의미하며, 이에 따라서 해당 플래그가 0이 나올때가지 계속 나타내어야 한다. "MaxNumNoRem"는 최대로 나타낼 AbsGrX를 나타내며, "RemAbs"는 인덱스값이 MaxNumNoRem을 갖닌 AbsGrXFlag값이 0일 때, 발생하는 나머지 값들을 이진화 표현 형태로 나타낸다. 이 때, 나머지 값을 나타내는 빈(bin)들은 바이패스(bypass)로 코딩한다. 예를 들어, MaxNumNoRem==8이면, index 절대값 5~8인 경우는 RemAbs의 값을 보내주어야 한다. 컨택스트 모델에서는, 이진화 과정에서 사용한 "Sig_flag", "Sign_flag", "AbsGrXFlag" 등이 해당된다.In the entropy coding unit 26, for example, it is possible to start scanning the weight parameters from the top left. That is, after scanning in a row-first manner from left to right, rows are scanned from top to bottom. Then, the quantized index values are binarized and stored in a binarized form. For example, the quantized index value may be represented by binarization using values of "Sig_flag", "Sign_flag", "AbsGrXFlag", "MaxNumNoRem" and/or "RemAbs". Here, "Sig_flag" indicates whether or not the index value is 0. "Sign_flag" represents the sign of a non-zero index value. "AbsGrXFlag" indicates the size of the index value, and if AbsGrXFlag==1, the corresponding index value is greater than the X value (generally expressed as an exponent of 2). Accordingly, when the corresponding flag is 0 The branches should continue to appear. "MaxNumNoRem" represents AbsGrX to be expressed as the maximum, and "RemAbs" represents the remaining values generated when the AbsGrXFlag value with MaxNumNoRem is 0 in the form of binary representation. At this time, bins representing the remaining values are coded as bypass. For example, if MaxNumNoRem==8, the value of RemAbs should be sent when the index absolute value is 5-8. In the context model, "Sig_flag", "Sign_flag", "AbsGrXFlag" and the like used in the binarization process correspond.

도 9는 본 발명의 일 실시예에 따른 심층 신경망 복원 장치(154)의 구성의 일례를 보여 주는 블록도이다. 도 9를 참조하면, 심층 신경망 복원 장치(154)는 엔트로피 디코딩 유닛(32)과 역양자화 유닛(34)을 포함한다. 실시 형태에 따라서는 심층 신경망 복원 장치(154)는 파라미터 보강 유닛(36)을 더 포함할 수도 있다. 도 1을 참조하여 설명한 바와 같이, 심층 신경망 복원 장치(154)에서의 복호화 과정은, 도 3을 참조하여 설명한 심층 신경망 압축 장치(152)에서의 인코딩 과정에 대응한다. 즉, 심층 신경망 복원 장치(154)는, 입력된 비트스트림에 포함되어 있는 메타데이터에 기초하여, 인코딩 과정과는 반대로 데이터를 복원함으로써, 심층 신경망을 복원한다.9 is a block diagram showing an example of a configuration of an apparatus 154 for reconstructing a deep neural network according to an embodiment of the present invention. Referring to FIG. 9, the deep neural network restoration apparatus 154 includes an entropy decoding unit 32 and an inverse quantization unit 34. Depending on the embodiment, the deep neural network restoration apparatus 154 may further include a parameter reinforcement unit 36. As described with reference to FIG. 1, the decoding process in the deep neural network decompression apparatus 154 corresponds to the encoding process in the deep neural network compression apparatus 152 described with reference to FIG. 3. That is, the deep neural network restoration apparatus 154 restores the deep neural network by restoring the data in the opposite direction to the encoding process, based on the metadata included in the input bitstream.

전술한 바와 같이, 이상의 설명은 실시예에 불과할 뿐이며 이에 의하여 한정되는 것으로 해석되어서는 안된다. 본 발명의 기술 사상은 후술하는 특허청구범위에 기재된 발명에 의해서만 특정되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다. 따라서 전술한 실시예가 다양한 형태로 변형되어 구현될 수 있다는 것은 통상의 기술자에게 자명하다.As described above, the above description is merely an example and should not be construed as being limited thereto. The technical idea of the present invention should be specified only by the invention described in the claims to be described later, and all technical ideas within the scope equivalent thereto should be interpreted as being included in the scope of the present invention. Therefore, it is obvious to those skilled in the art that the above-described embodiments can be modified and implemented in various forms.

Claims

A parameter reduction unit for simplifying a weight matrix describing a deep neural network;
A quantization unit for quantizing the weight matrix reduced by the parameter reduction unit; And
And an entropy coding unit for scanning the weight matrix quantized by the quantization unit in a predetermined direction and then sequentially entropy coding the scanned weights to output a bitstream,
The parameter reduction unit simplifies the weight matrix by using at least one of a pruning technique and a low rank approximation technique.

The method of claim 1,
The parameter reduction unit outputs information indicating a technique used for simplification of the weight matrix.

The method of claim 1,
The low-level approximation technique is characterized in that expressing by decomposing the original weight matrix of the trained deep neural network into a plurality of lower-dimensional low rank weight matrices. Deep neural network compression device.

The method of claim 3,
The parameter reduction unit includes a circulant operator, a rank value, dimension and shape information of the initial weight matrix, and a reshaping mode used in the low layer approximation technique. Among them, a compression apparatus for a trained deep neural network, characterized in that outputting one or more.

A parameter reduction step for simplifying a weight matrix describing the trained deep neural network;
A quantization step for quantizing the weight matrix reduced in the parameter reduction step; And
And an entropy coding step of scanning the quantized weight matrix in a predetermined direction in the quantization step, sequentially entropy coding the scanned weights, and outputting a bitstream,
In the parameter reduction step, a method of compressing a trained deep neural network, characterized in that at least one of a pruning technique and a low rank approximation technique is used.