KR20220008135A

KR20220008135A - Method and apparatus for image super resolution

Info

Publication number: KR20220008135A
Application number: KR1020200086360A
Authority: KR
Inventors: 강석주; 서유림
Original assignee: 서강대학교산학협력단
Priority date: 2020-07-13
Filing date: 2020-07-13
Publication date: 2022-01-20
Also published as: KR102582706B1

Abstract

A method of operating a computing device operated by at least one processor includes the following steps of: instructing a teacher model by using learning data including a low-resolution image and a high-resolution image corresponding to the low-resolution image, and then, initially instructing a student model with the learning data; applying a weighted value to characteristic values with high importance among characteristic values generated during the procedure of instructing the teacher model, and then, additionally instructing the student model by using the important characteristic values; and inputting a random image into the student model, and then, outputting an image made by increasing the resolution of the random image. The student model is a deep learning model which is equal to or smaller in size than the teacher model. Therefore, the present invention is capable of reducing computational complexity and memory consumption.

Description

Image super-resolution processing method and apparatus

본 발명은 영상의 해상도를 높이는 기술에 관한 것이다.The present invention relates to a technique for increasing the resolution of an image.

최근 Ultra-High Definition(UHD) 등의 고해상도 디스플레이가 시장에 등장하고, 이에 따라 높은 해상도에 대한 소비자의 요구가 커지게 되었다. 따라서 기존의 Full-High Definition(FHD)와 같은 저해상도(Low Resolution, LR) 영상을 고해상도(High Resolution, HR) 영상으로 변환할 수 있는 초해상화 (Super-Resolution, SR) 알고리즘에 대한 관심이 커지고 있다. Recently, high-resolution displays such as Ultra-High Definition (UHD) have appeared in the market, and accordingly, consumer demand for high resolution has increased. Therefore, interest in the Super-Resolution (SR) algorithm that can convert low-resolution (LR) images such as the existing Full-High Definition (FHD) into high-resolution (HR) images is growing. have.

예를 들어, 바이큐빅 보간법(Bicubic Interpolation)과 같은 다항식 기반의 보간법 또는 선형 매핑(Linear Mapping)을 사용하는 로컬 패치(Local Patch) 기반의 초해상화 기법은 낮은 복잡도와 적은 연산량으로 고품질의 고해상도 영상을 생성해낼 수 있다. 따라서 이러한 기법들은 실제 디스플레이 시스템에 적용되는데 문제가 없다.For example, a polynomial-based interpolation method such as Bicubic Interpolation or a local patch-based super-resolution method using Linear Mapping uses high-quality, high-resolution images with low complexity and small amount of computation. can create Therefore, there is no problem in applying these techniques to an actual display system.

그러나 이러한 기존의 초해상화 기법들은 영상 복원 과정에서 디테일한 부분의 화질 저화 및 열화가 존재한다. 또한 기존 기법들은 비교적 단순한 선형 매핑을 기반으로 하기 때문에, 복잡하고 비선형적인 저해상도-고해상도 모델을 구현하기 힘들다. However, in these existing super-resolution techniques, there is a deterioration and deterioration of the image quality of the detailed part in the image restoration process. In addition, since existing techniques are based on relatively simple linear mapping, it is difficult to implement a complex and nonlinear low-resolution-high-resolution model.

위의 문제를 해결하기 위해 최근 딥러닝(Deep learning) 기반의 초해상화 알고리즘에 대해 다양한 연구가 진행 중이며, 기존 알고리즘 대비 높은 성능을 보여주고 있다.In order to solve the above problem, various studies are being conducted on super-resolution algorithms based on deep learning recently, and they show higher performance compared to existing algorithms.

특히 합성곱 신경망(Convolutional Neural Network, CNN) 기반의 초해상화 네트워크는 기존의 알고리즘 대비 높은 성능을 보여주고 있어, 디스플레이 시스템에 이를 구현하려는 시도가 진행 중이다. 합성곱 신경망은 여러 겹으로 쌓은 다층 네트워크를 사용하여, 저해상도 입력과 고해상도 출력 간의 복잡한 비선형적 관계를 정밀하게 분석한다. 합성곱 필터 파라미터(Filter Parameter)를 학습하기 때문에, 단순한 선형 매핑을 기반으로 하는 기존의 알고리즘보다 우수한 성능을 보여준다. In particular, the convolutional neural network (CNN)-based super-resolution network shows high performance compared to the existing algorithms, and attempts are being made to implement it in a display system. Convolutional neural networks use multi-layered networks to precisely analyze complex nonlinear relationships between low-resolution inputs and high-resolution outputs. Because it learns convolutional filter parameters, it shows better performance than existing algorithms based on simple linear mapping.

그러나 딥러닝 기반의 초해상화 알고리즘은 많은 수의 레이어와 매개 변수를 수반하여 높은 복잡도를 갖고, 연산량과 메모리 소비량이 증가하는 문제가 발생할 수 있다. 예를 들어, 대표적인 딥러닝 기반의 초해상화 네트워크인 Very Deep Super Resolution(VDSR)은 20겹의 층을 가지고 있고 60만개 이상의 필터 파라미터를 요구한다. However, the deep learning-based super-resolution algorithm has high complexity as it involves a large number of layers and parameters, and there may be problems in which the amount of computation and memory consumption increase. For example, Very Deep Super Resolution (VDSR), a representative deep learning-based super-resolution network, has 20 layers and requires more than 600,000 filter parameters.

이로 인해 딥러닝 기반의 초해상화 알고리즘을 실제 디스플레이에 구현하는 과정에서 문제가 발생된다. 구체적으로, 하드웨어 리소스가 제한된 내장형 시스템 및 모바일 장치와 같은 저전력 장치는 전력 소비와 메모리 크기가 제한적이므로, 딥러닝 기반 초해상화 알고리즘을 배포하는 데 장애가 될 수 있다. 따라서 딥러닝 기반의 초해상화 알고리즘을 경량화하면서도 동시에 성능 감소를 최소화하는 방법이 필요하다. This causes a problem in the process of implementing the deep learning-based super-resolution algorithm on the actual display. Specifically, low-power devices such as embedded systems and mobile devices with limited hardware resources have limited power consumption and memory size, which may become an obstacle to deploying deep learning-based super-resolution algorithms. Therefore, there is a need for a method to lighten the deep learning-based super-resolution algorithm and at the same time minimize the performance decrease.

해결하고자 하는 과제는 학습 이미지를 이용하여 교사 네트워크를 학습시킨 후, 학습된 교사 네트워크의 정보들을 이용하여 경량화된 학생 네트워크를 학습시키고, 학생 네트워크를 이용하여 해상도가 높은 이미지를 출력하는 방법을 제공하는 것이다.The task to be solved is to provide a method of learning a teacher network using a learning image, learning a lightweight student network using the information of the learned teacher network, and outputting a high-resolution image using the student network. will be.

한 실시예에 따른 적어도 하나의 프로세서에 의해 동작하는 컴퓨팅 장치가 동작하는 방법으로서, 저해상도 이미지와 상기 저해상도 이미지에 대응되는 고해상도 이미지가 포함된 학습 데이터를 이용하여 교사 모델을 학습시키고, 상기 학습 데이터로 학생 모델을 초기 학습시키는 단계, 상기 교사 모델의 학습 과정에서 생성된 특징값들 중 중요도가 높은 특징값들에 가중치를 부여하고, 중요 특징값들을 이용하여 상기 학생 모델을 추가 학습시키는 단계, 그리고 임의의 이미지를 상기 학생 모델에 입력하고, 상기 임의의 이미지의 해상도를 높인 이미지를 출력하는 단계를 포함한다. 상기 학생 모델은 상기 교사 모델보다 같거나 작은 크기의 딥러닝 모델이다.As a method of operating a computing device operated by at least one processor according to an embodiment, a teacher model is trained using a low-resolution image and training data including a high-resolution image corresponding to the low-resolution image, and the learning data The step of initially learning the student model, assigning weights to high-importance feature values among the feature values generated in the learning process of the teacher model, and further learning the student model using the important feature values, and optionally and inputting an image of the student model into the student model, and outputting an image with an increased resolution of the arbitrary image. The student model is a deep learning model having a size equal to or smaller than that of the teacher model.

상기 교사 모델은, 상기 학습 데이터의 특징을 추출하는 복수의 컨벌루션 레이어들, 상기 컨벌루션 레이어들의 결과를 전달하는 활성화 함수, 그리고 상기 활성화 함수의 결과를 스케일링하는 멀티 레이어를 포함하는 잔여 블록(Residual Block)을 적어도 하나 이상 포함할 수 있다.The teacher model, a residual block including a plurality of convolutional layers for extracting features of the learning data, an activation function transferring the results of the convolutional layers, and a multi-layer scaling the result of the activation function (Residual Block) It may include at least one or more.

상기 학생 모델을 추가 학습시키는 단계는, 상기 교사 모델의 각 잔여 블록의 출력값들을 이용하여 상기 학생 모델의 손실 함수를 수정할 수 있다.In the step of additionally learning the student model, the loss function of the student model may be modified using output values of each residual block of the teacher model.

상기 학생 모델을 추가 학습시키는 단계는, 상기 각 잔여 블록의 출력값들 중 상기 교사 모델이 상기 저해상도 이미지로부터 상기 고해상도 이미지를 생성하기 위한 중요한 정보로 판단되어 가중치가 부여된 잔여 블록의 출력값들을 상기 중요 특징값들로 판단할 수 있다.In the step of further learning the student model, among the output values of each residual block, the teacher model is determined as important information for generating the high-resolution image from the low-resolution image, and the output values of the weighted residual blocks are used as the important features. values can be determined.

한 실시예에 따른 컴퓨팅 장치로서, 메모리, 그리고 상기 메모리에 로드된 프로그램의 명령들(instructions)을 실행하는 적어도 하나의 프로세서를 포함한다. 상기 프로그램은 저해상도 이미지와 상기 저해상도 이미지에 대응되는 고해상도 이미지가 포함된 학습 데이터를 이용하여 학생 모델을 초기 학습시키는 단계, 상기 학습 데이터로 학습이 완료된 교사 모델로부터, 상기 교사 모델의 학습 과정에서 생성된 출력값들을 추출하고, 상기 출력값들을 이용하여 상기 학생 모델을 재학습시키는 단계, 그리고 상기 학생 모델에 임의의 저해상도 이미지를 입력하고, 상기 임의의 이미지의 고해상도 이미지를 출력하는 단계를 실행하도록 기술된 명령들을 포함한다. 상기 학생 모델은 상기 교사 모델보다 같거나 작은 크기의 딥러닝 모델이다.A computing device according to one embodiment includes a memory and at least one processor executing instructions of a program loaded into the memory. The program includes the steps of initially learning a student model using learning data including a low-resolution image and a high-resolution image corresponding to the low-resolution image, from a teacher model whose learning is completed with the learning data, generated in the learning process of the teacher model extracting output values, retraining the student model using the output values, inputting an arbitrary low-resolution image to the student model, and outputting a high-resolution image of the arbitrary image. include The student model is a deep learning model having a size equal to or smaller than that of the teacher model.

상기 재학습시키는 단계는, 상기 출력값들 중 상기 교사 모델이 상기 저해상도 이미지로부터 상기 고해상도 이미지를 생성하기 위한 중요한 정보로 결정된 중요 출력값들에 가중치를 부여하고, 가중치가 부여된 중요 출력값들을 이용하여 상기 학생 모델의 손실 함수를 수정할 수 있다.In the re-learning, weight is given to important output values determined by the teacher model as important information for generating the high-resolution image from the low-resolution image among the output values, and using the weighted important output values, the student We can modify the loss function of the model.

상기 교사 모델과 상기 학생 모델은, 상기 학습 데이터의 특징을 추출하는 복수의 컨벌루션 레이어들과 상기 컨벌루션 레이어들의 결과를 전달하는 활성화 함수를 포함하는 블록(Block)을 적어도 하나 이상 포함할 수 있다. 상기 재학습시키는 단계는, 상기 교사 모델에 포함된 각 블록으로부터 출력값들을 추출할 수 있다.The teacher model and the student model may include at least one block including a plurality of convolutional layers for extracting features of the learning data and an activation function for transferring the results of the convolutional layers. The re-learning may include extracting output values from each block included in the teacher model.

상기 재학습시키는 단계는, 상기 초기 학습에 사용된 손실 함수와, 상기 중요 출력값들과 상기 학생 모델의 초기 학습 과정에서 출력된 출력값들의 차이에 의한 손실 함수를 이용할 수 있다.In the re-learning, a loss function used for the initial learning and a loss function due to a difference between the important output values and output values output during the initial learning process of the student model may be used.

본 발명에 따르면 매개 변수의 수가 적은 학생 네트워크를 이용하여 이미지의 해상도를 높일 수 있으므로, 계산량과 메모리 소비를 줄이고 리소스에 제한이 있는 하드웨어에도 배포할 수 있다.According to the present invention, it is possible to increase the resolution of an image by using a student network with a small number of parameters, thereby reducing the amount of computation and memory consumption and distributing it to hardware with limited resources.

또한 본 발명에 따르면 교사 네트워크의 학습 단계에서 중요성이 높은 특정 정보를 지식 증류 기법으로 학생 네트워크에 전달하므로, 네트워크의 구조를 변경하거나 매개 변수의 수를 늘리지 않고도 학생 네트워크를 학습시킬 수 있다.In addition, according to the present invention, since specific information of high importance in the learning stage of the teacher network is transmitted to the student network using the knowledge distillation technique, the student network can be trained without changing the structure of the network or increasing the number of parameters.

도 1은 한 실시예에 따른 영상 초해상화 장치의 구성도이다.
도 2는 한 실시예에 따른 교사 네트워크의 설명도이다.
도 3은 한 실시예에 따른 학생 네트워크의 설명도이다.
도 4는 한 실시예에 따른 영상 초해상화 장치의 동작 방법의 흐름도이다.
도 5와 도 6은 한 실시예에 따른 지식 전파 방법의 설명도이다.
도 7은 한 실시예에 따른 영상 초해상화 장치의 동작 방법의 예시도이다.
도 8은 한 실시예에 따른 컴퓨팅 장치의 하드웨어 구성도이다.1 is a block diagram of an image super-resolution apparatus according to an embodiment.
2 is an explanatory diagram of a teacher network according to an embodiment.
3 is an explanatory diagram of a student network according to an embodiment.
4 is a flowchart of a method of operating an image super-resolution apparatus according to an embodiment.
5 and 6 are explanatory diagrams of a knowledge dissemination method according to an embodiment.
7 is an exemplary diagram of a method of operating an image super-resolution apparatus according to an embodiment.
8 is a hardware configuration diagram of a computing device according to an embodiment.

아래에서는 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, with reference to the accompanying drawings, the embodiments of the present invention will be described in detail so that those of ordinary skill in the art to which the present invention pertains can easily implement them. However, the present invention may be embodied in several different forms and is not limited to the embodiments described herein. And in order to clearly explain the present invention in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 "…부", "…기", "모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다.Throughout the specification, when a part "includes" a certain element, it means that other elements may be further included, rather than excluding other elements, unless otherwise stated. In addition, terms such as “…unit”, “…group”, and “module” described in the specification mean a unit that processes at least one function or operation, which may be implemented as hardware or software or a combination of hardware and software. have.

이하에서는, 본 발명에 적용되는 네트워크 경량화 방법과 지식 증류 기법에 대해 간략히 설명한다.Hereinafter, a network lightening method and a knowledge distillation technique applied to the present invention will be briefly described.

먼저, 경량 아키텍처(Lightweight Architecture)란 최소의 매개 변수로 인공 신경망을 보다 효율적으로 학습시키기 위한 네트워크를 의미한다. 예를 들어 ResNet, DensNet, MobileNet 등이 있다. 특히 SqueezeNet은 AlexNet과 비교할때 50배 적은 매개 변수로도 비슷한 성능을 나타낸다. 구체적으로, SqueezeNet은 3*3 컨볼루션 필터를 1*1과 3*3 필터의 혼합으로 대체하는 Fire 모듈을 사용한다. Fire 모듈은 스퀴즈 레이어와 확장 레이어를 포함한다. 스퀴즈 레이어는 1*1 필터로만 구성되어 있고, 확장 레이어는 1*1 필터와 3*3 필터를 포함한다. 이하 SqueezeNet은 공지된 기술이므로, 자세한 설명은 생략한다.First, a lightweight architecture refers to a network for more efficiently learning an artificial neural network with a minimum of parameters. Examples include ResNet, DensNet, MobileNet, etc. In particular, SqueezeNet shows similar performance with 50 times fewer parameters compared to AlexNet. Specifically, SqueezeNet uses a Fire module that replaces 3*3 convolutional filters with a mixture of 1*1 and 3*3 filters. The Fire module includes a squeeze layer and an extension layer. The squeeze layer consists of only a 1*1 filter, and the extension layer includes a 1*1 filter and a 3*3 filter. Hereinafter, since SqueezeNet is a known technology, a detailed description thereof will be omitted.

지식 증류(Knowledge Distillation)란, 앙상블(Ensemble) 기법을 통해 학습된 다수의 큰 교사 네트워크들로부터 작은 하나의 학생 네트워크에 지식을 전달하는 방법 중 하나이다. 이때 교사 네트워크는 고성능 신경 네트워크이고, 학생 네트워크는 얕고 성능이 낮은 경우가 일반적이다. 교사 네트워크 및 학생 네트워크의 출력 값만 사용하거나, 숨겨진 특징값(Features)들을 이용하는 연구들이 진행되었다. Knowledge Distillation is one of the methods of transferring knowledge from a large number of large teacher networks learned through an ensemble technique to a single small student network. In this case, the teacher network is a high-performance neural network, and the student network is shallow and has low performance. Studies have been conducted using only the output values of the teacher network and the student network, or using hidden features.

본 발명은 지식 증류 기법을 활용해 딥러닝 기반의 초해상화 네트워크를 효율적으로 학습시킬 수 있는 방법과, 딥러닝 네트워크 경량화 기법 중 하나인 SqueezeNet을 활용하여 경량화에 따른 성능 감소를 최소화한 네트워크를 제안한다. 이하에서는 본 발명이 제안하는 영상 초해상화 장치에 대해 설명한다.The present invention proposes a method that can efficiently train a deep learning-based super-resolution network by using a knowledge distillation technique, and a network that minimizes performance reduction due to weight reduction by using SqueezeNet, one of the deep learning network lightweighting techniques do. Hereinafter, an image super-resolution apparatus proposed by the present invention will be described.

도 1은 한 실시예에 따른 영상 초해상화 장치의 구성도이고, 도 2는 한 실시예에 따른 교사 네트워크의 설명도이고, 도 3은 한 실시예에 따른 학생 네트워크의 설명도이다.1 is a block diagram of an image super-resolution apparatus according to an embodiment, FIG. 2 is an explanatory diagram of a teacher network according to an embodiment, and FIG. 3 is an explanatory diagram of a student network according to an embodiment.

도 1을 참고하면, 영상 초해상화 장치(10)는 저해상도의 이미지를 입력받아 고해상도 이미지로 복원한다. 이때 교사 네트워크(100)를 미리 학습시킨 후, 학습된 교사 네트워크(100)를 이용하여 경량화된 학생 네트워크(200)를 학습시키고, 학생 네트워크(200)를 이용하여 이미지 해상도를 높일 수 있다. Referring to FIG. 1 , the image super-resolution apparatus 10 receives a low-resolution image and restores it to a high-resolution image. In this case, after learning the teacher network 100 in advance, the lightweight student network 200 may be learned using the learned teacher network 100 , and image resolution may be increased using the student network 200 .

지식 증류 기법을 이용하여, 이미 학습이 완료된 교사 네트워크(100)의 정보들을 바탕으로, 학생 네트워크(200)의 가중치 파라미터들을 학습시킬 수 있다.By using the knowledge distillation technique, weight parameters of the student network 200 may be learned based on information of the teacher network 100 that has already been learned.

이때 학생 네트워크(200)는 교사 네트워크(100)의 행동을 모사하는 방식으로 학습된다. 이를 통해 학생 네트워크(200)는 일반적인 역전파(Backpropagation) 알고리즘을 사용하여 처음부터 학습한 경우보다 높은 인식률을 얻을 수 있다. At this time, the student network 200 is learned in a way that imitates the behavior of the teacher network 100 . Through this, the student network 200 can obtain a higher recognition rate than when learning from the beginning using a general backpropagation algorithm.

교사 네트워크(100)와 학생 네트워크(200)는 하나의 컴퓨팅 장치에 구현되거나, 별도의 컴퓨팅 장치에 분산 구현될 수 있다. 별도의 컴퓨팅 장치에 분산 구현된 경우, 교사 네트워크(100)와 학생 네트워크(200)는 통신 인터페이스를 통해 서로 통신할 수 있다. 컴퓨팅 장치는 본 발명을 수행하도록 작성된 소프트웨어 프로그램을 실행할 수 있는 장치이면 충분하고, 예를 들면, 서버, 랩탑 컴퓨터 등일 수 있다. The teacher network 100 and the student network 200 may be implemented in one computing device or distributed in separate computing devices. When distributed in separate computing devices, the teacher network 100 and the student network 200 may communicate with each other through a communication interface. The computing device may be any device capable of executing a software program written to carry out the present invention, and may be, for example, a server, a laptop computer, or the like.

교사 네트워크(100)와 학생 네트워크(200) 각각은 하나의 인공지능 모델일 수 있고, 복수의 인공지능 모델로 구현될 수도 있다. 그리고 영상 초해상화 장치(10)도 하나의 인공지능 모델일 수 있고, 복수의 인공지능 모델로 구현될 수도 있다. 이에 따라, 상술한 구성들에 대응하는 하나 또는 복수의 인공지능 모델은 하나 또는 복수의 컴퓨팅 장치에 의해 구현될 수 있다.Each of the teacher network 100 and the student network 200 may be one AI model or may be implemented with a plurality of AI models. And the image super-resolution device 10 may also be a single artificial intelligence model, may be implemented as a plurality of artificial intelligence models. Accordingly, one or a plurality of artificial intelligence models corresponding to the above-described configurations may be implemented by one or a plurality of computing devices.

도 2를 참고하면, 교사 네트워크(100)는 컨벌루션 레이어, 잔여 블록(Residual Block) 그리고 업스케일링 모듈을 포함한다. 교사 네트워크(100)는 EDSR(Enhanced Deep Super Resolution)로 구현될 수 있으며, 이하 교사 네트워크(100)를 구성하는 각 요소들에 대해 설명한다.Referring to FIG. 2 , the teacher network 100 includes a convolutional layer, a residual block, and an upscaling module. The teacher network 100 may be implemented as EDSR (Enhanced Deep Super Resolution), and each element constituting the teacher network 100 will be described below.

컨벌루션 레이어는 입력되는 저해상도 이미지의 특징을 추출하는 역할을 한다. The convolutional layer serves to extract features of the input low-resolution image.

각 잔여 블록은 9개의 잔여 모듈(Residual Module)을 포함하며, 하나의 잔여 모듈은 2개의 컨벌루션 레이어, 하나의 정류 선형 유닛(Rectified Linear Unit, ReLU) 그리고 멀티 레이어로 구성되어 있다. Each residual block includes nine residual modules, and one residual module consists of two convolutional layers, one Rectified Linear Unit (ReLU), and multi-layers.

정류 선형 유닛은 활성화 함수 중 하나이고, 멀티 레이어는 각 모듈의 마지막에 특정 상수값을 곱해주는 역할을 한다. 상수값은 한 예로 0.1일 수 있다.The commutation linear unit is one of the activation functions, and the multi-layer serves to multiply the end of each module by a specific constant value. The constant value may be, for example, 0.1.

업스케일링 모듈은 이미지의 해상도를 높이는 역할을 하고, Sub-pixel Convolutional Neural Network(SPCNN)으로 구현될 수 있으며, 바이큐빅 보간법(Bicubic Interpolation)과 같은 다항식 보간법(Polynomial Interpolation)보다 좋은 성능을 나타낸다.The upscaling module serves to increase the resolution of the image, can be implemented as a sub-pixel convolutional neural network (SPCNN), and exhibits better performance than polynomial interpolation such as bicubic interpolation.

3개의 잔여 블록 및 9개의 잔여 모듈은 서로 스킵 커넥션(Skip Connection)으로 이어질 수 있다. 스킵 커넥션이란, 하나의 컨벌루션 레이어의 입력값을 다른 컨벌루션 레이어의 출력값에 직접 더하여, 각 컨벌루션 레이어는 입력과 출력의 차이값(Residual)만을 예측하도록 하는 방법이다.The 3 remaining blocks and 9 remaining modules may lead to a skip connection with each other. Skip connection is a method in which an input value of one convolutional layer is directly added to an output value of another convolutional layer so that each convolutional layer predicts only a residual value between an input and an output.

도 3을 참고하면, 학생 네트워크(200)는 컨벌루션 레이어, 스퀴즈 블록(Squeeze Block) 그리고 업스케일링 모듈을 포함한다. 학생 네트워크(200)는 SqueezeSR로 구현될 수 있으며, 이하 학생 네트워크(200)를 구성하는 각 요소들에 대해 설명한다.Referring to FIG. 3 , the student network 200 includes a convolutional layer, a squeeze block, and an upscaling module. The student network 200 may be implemented with SqueezeSR, and each element constituting the student network 200 will be described below.

교사 네트워크(100)와 마찬가지로, 처음의 컨벌루션 레이어는 입력되는 저해상도 이미지의 특징들을 추출한다. 이후 3개의 스퀴즈 블록이 연결되며, 각 스퀴즈 블록은 스퀴즈 레이어와 확장 레이어를 포함한다. 한편 각 스퀴즈 블록은 스킵 커넥션으로 연결될 수 있다.Like the teacher network 100, the first convolutional layer extracts features of the input low-resolution image. Thereafter, three squeeze blocks are connected, and each squeeze block includes a squeeze layer and an extension layer. Meanwhile, each squeeze block may be connected by a skip connection.

스퀴즈 레이어는 1x1 컨벌루션 필터를 의미하고, 확장 레이어는 1x1 컨벌루션 필터와 3x3 컨벌루션 필터가 연결된 것을 의미한다.The squeeze layer refers to a 1x1 convolution filter, and the extension layer refers to a connection between a 1x1 convolution filter and a 3x3 convolution filter.

스퀴즈 레이어는 입력 채널의 수를 줄여서 다음 레이어에 전달한다. 즉 확장 레이어에 전달되는 채널의 크기를 줄일 수 있다. 이는 일반적인 CNN 모델에 비해 파라미터 수가 9배 감소함을 의미한다.The squeeze layer reduces the number of input channels and passes them to the next layer. That is, the size of the channel delivered to the extension layer can be reduced. This means that the number of parameters is reduced by 9 times compared to the general CNN model.

한편 공지된 SqueezeNet의 형태와 달리, 본 발명의 학생 네트워크(200)는 배치 정규화(Batch Normalization)를 위한 레이어와 다운 샘플링 계층을 포함하지 않는다. Meanwhile, unlike the known SqueezeNet type, the student network 200 of the present invention does not include a layer for batch normalization and a downsampling layer.

이미지의 해상도를 높이는 업스케일링 모듈은 교사 네트워크(100)와 마찬가지로 SPCNN으로 구현될 수 있다.The upscaling module for increasing the resolution of the image may be implemented as SPCNN like the teacher network 100 .

이하에서는 학생 네트워크(200)가 교사 네트워크(100)의 학습 과정에서 생성된 정보를 받아 추가 학습하는 과정에 대해 설명한다. Hereinafter, a process in which the student network 200 receives information generated in the learning process of the teacher network 100 and additionally learns will be described.

도 4는 한 실시예에 따른 영상 초해상화 장치의 동작 방법의 흐름도이고, 도 5와 도 6은 한 실시예에 따른 지식 전파 방법의 설명도이다.4 is a flowchart of a method of operating an image super-resolution apparatus according to an embodiment, and FIGS. 5 and 6 are explanatory diagrams of a knowledge propagation method according to an embodiment.

도 4를 참고하면, 영상 초해상화 장치(10)는 저해상도 이미지와 이에 대응되는 고해상도 이미지를 포함하는 학습 데이터를 이용하여 교사 네트워크(100)와 학생 네트워크(200)를 초기 학습시킨다(S110). 이때 초기 학습에 사용되는 이미지는 컬러 이미지로서, RGB의 3개의 색상 채널을 갖는 이미지일 수 있다. Referring to FIG. 4 , the image super-resolution apparatus 10 initially learns the teacher network 100 and the student network 200 using learning data including a low-resolution image and a high-resolution image corresponding thereto (S110). In this case, the image used for initial learning is a color image, and may be an image having three color channels of RGB.

영상 초해상화 장치(10)는 학습된 교사 네트워크(100)와 학생 네트워크(200)를 TAT(Teacher Attention Transfer) 모듈로 연결한다(S120).The image super-resolution device 10 connects the learned teacher network 100 and the student network 200 with a TAT (Teacher Attention Transfer) module (S120).

학생 네트워크(200)의 특징맵과 교사 네트워크(100)의 특징맵이 유사하므로, 교사 네트워크(100)의 학습 과정에서 중요하다고 판단된 특징맵 정보를 학생 네트워크(200)에 전달한다면, 학생 네트워크(200)를 더 효율적으로 학습시킬 수 있다. Since the feature map of the student network 200 and the feature map of the teacher network 100 are similar, if the feature map information determined to be important in the learning process of the teacher network 100 is transmitted to the student network 200, the student network ( 200) can be learned more efficiently.

일반적으로 합성곱 신경망(Convolutional Neural Network)에서는 3개의 색상 채널을 갖는 3차원 입력 데이터의 각 채널마다 필터를 사용하여 합성곱 연산을 수행한다. 이때 필터는 입력된 이미지의 특징을 찾아내는 역할을 한다. 필터의 크기는 이미지의 크기보다 작으므로, 필터는 이미지의 일부분의 특성을 반영하여 특징맵이라는 결과를 출력한다. In general, in a convolutional neural network, a convolution operation is performed using a filter for each channel of three-dimensional input data having three color channels. In this case, the filter plays a role in finding the characteristics of the input image. Since the size of the filter is smaller than the size of the image, the filter reflects the characteristics of a part of the image and outputs a result called a feature map.

이때, 특정 필터의 출력값들은 이미지 해상도를 높이는 학습 과정에서 더 중요한 요소로 간주될 수 있다. 따라서 이러한 필터의 출력값에는 가중치를 부여하여 중요도를 반영할 수 있다.In this case, the output values of the specific filter may be considered as more important factors in the learning process for increasing the image resolution. Therefore, it is possible to reflect importance by assigning weights to the output values of these filters.

한편 가중치를 부여하는 방법으로서 어텐션 메커니즘 기반의 TAT 모듈을 사용할 수 있다.Meanwhile, as a method of assigning weights, an attention mechanism-based TAT module may be used.

어텐션 메커니즘(Attention Mechanism)이란, 모델로 하여금 중요한 부분에 집중하도록 하는 구조로서, 디코더에서 출력값을 예측하는 매 시점마다, 인코더에서의 전체 입력값을 다시 참고한다. 이때 입력값 전체를 동일한 비율로 참고하는 것이 아니라, 예측해야 할 값과 연관이 있는 입력값의 특정 부분에 집중(Attention)한다. 한편 딥러닝 모델에서, 어텐션은 가중치로 표현되는 중요도 벡터로 구현될 수 있다. 어텐션 메커니즘에 관한 내용은 이미 공지된 기술인바, 자세한 설명은 생략한다.The attention mechanism (Attention Mechanism) is a structure that allows the model to focus on an important part, and every time the decoder predicts the output value, the entire input value from the encoder is referenced again. At this time, instead of referring to the entire input value at the same rate, attention is focused on a specific part of the input value that is related to the value to be predicted. Meanwhile, in a deep learning model, attention can be implemented as an importance vector expressed as a weight. Since the content of the attention mechanism is already known technology, a detailed description thereof will be omitted.

즉, TAT 모듈은 학생 네트워크(200)가 교사 네트워크(100)의 정보를 받을 때, 교사 네트워크(100)의 중요하다고 판단된 필터의 출력값에 초점을 맞추도록 하여, 채널 별로 중요도가 높은 정보에 대한 가중치를 학생 네트워크(200)에 전달해 학생 네트워크(200)가 학습 시 보다 중요한 정보에 집중하여 학습하도록 도와준다.That is, when the student network 200 receives information from the teacher network 100 , the TAT module focuses on the output value of the filter determined to be important of the teacher network 100 , so that information of high importance for each channel is provided. By passing the weights to the student network 200 , the student network 200 helps the student network 200 to focus on more important information during learning.

영상 초해상화 장치(10)는 교사 네트워크(100)의 채널별 중요도가 높은 정보에 대한 가중치를 포함한 정보를 학생 네트워크(200)로 전달한다(S130). The image super-resolution apparatus 10 transmits information including a weight for information having high importance for each channel of the teacher network 100 to the student network 200 ( S130 ).

구체적으로, TAT 모듈은 교사 네트워크(100)를 구성하는 각 잔여 블록으로부터 출력된 특징맵인 t_n을 얻고, 학생 네트워크(200)를 구성하는 각 스퀴즈 블록으로부터 출력 특징맵 s_n을 얻는다. 이때 n은 잔여 블록 또는 스퀴즈 블록의 개수이며, 도 5의 예를 참고하면 1부터 3까지의 자연수이다.Specifically, the TAT module obtains a feature map t_n output from each residual block constituting the teacher network 100 , and obtains an output feature map s_n from each squeeze block constituting the student network 200 . In this case, n is the number of residual blocks or squeeze blocks, and is a natural number from 1 to 3 with reference to the example of FIG. 5 .

TAT 모듈은 각 블록의 출력값을 이용하여 손실 함수를 계산하는데, 이때 손실 함수에 입력되는 값 즉 TAT 모듈의 출력값은 수학식 1을 통해 구할 수 있다.The TAT module calculates the loss function by using the output value of each block. In this case, the value input to the loss function, ie, the output value of the TAT module, can be obtained through Equation 1.

[수학식 1][Equation 1]

수학식 1에서, k는 성능을 높이기 위한 변수를 의미하고, c는 교사 네트워크(100)의 출력 특징맵의 채널 크기를 의미하고, c'는 학생 네트워크(200)의 출력 특징맵의 채널 크기를 의미한다. T_n과 S_n은 최종 손실 함수에 입력되는 값이다.In Equation 1, k denotes a variable for improving performance, c denotes the channel size of the output feature map of the teacher network 100, and c' denotes the channel size of the output feature map of the student network 200 it means. T_n and S_n are values input to the final loss function.

k가 1인 경우, T_i는 t_i의 평균 특징맵, S_i는 s_i의 평균 특징맵을 의미할 수 있다. When k is 1, T_i may mean an average feature map of t_i, and S_i may mean an average feature map of s_i.

한편 교사 네트워크(100)의 출력 특징맵 t_n은 크기가 C*H*W인 벡터 또는 텐서일 수 있다. 이때 C는 출력 특징맵의 채널 크기, H는 출력 특징맵의 높이, W는 출력 특징맵의 너비를 의미한다. Meanwhile, the output feature map t_n of the teacher network 100 may be a vector or a tensor having a size of C*H*W. Here, C is the channel size of the output feature map, H is the height of the output feature map, and W is the width of the output feature map.

교사 네트워크(100)의 채널별 어텐션 벡터를 얻기 위해, 출력 특징맵의 평균을 추출하는 글로벌 평균 풀링(Global Average Pooling)을 이용할 수 있다. 글로벌 평균 풀링을 각 출력 특징맵 t_n에 적용하고, 이후 활성화 함수를 통해서 결과적으로 크기가 C*1*1인 텐서를 얻을 수 있다. In order to obtain the attention vector for each channel of the teacher network 100 , global average pooling for extracting the average of the output feature map may be used. Global average pooling is applied to each output feature map t_n, and a tensor of size C*1*1 can be obtained as a result through an activation function.

이후 텐서에 학생 네트워크(200)의 출력 특징맵인 s_n을 순서대로 곱한다. 한편, 채널 단위의 텐서를 평균을 계산하여, 최종 텐서의 크기를 1*H*W로 만들 수 있다.Thereafter, the tensor is sequentially multiplied by s_n, which is the output feature map of the student network 200 . Meanwhile, by calculating the average of the tensor in units of channels, the size of the final tensor can be made 1*H*W.

영상 초해상화 장치(10)는 TAT 모듈을 통해 계산된 손실 함수를 추가로 이용하여 학생 네트워크(200)의 미세 조정(Fine-tuning)을 진행한다(S140). 이때 미세 조정에 사용되는 손실 함수는 수학식 2와 같을 수 있다. The image super-resolution apparatus 10 further uses the loss function calculated through the TAT module to perform fine-tuning of the student network 200 (S140). In this case, the loss function used for fine adjustment may be as in Equation (2).

[수학식 2][Equation 2]

수학식 2에서, y는 학생 네트워크(200)를 거쳐 출력된 고해상도 이미지를 의미하고,

는 학습에 사용되는 정답 이미지(Ground Truth, GT)를 의미한다. loss_0는 S110 단계에서 학생 네트워크(200)의 학습 결과에 따른 손실함수로서, 학생 네트워크(200)의 출력 이미지와 학습 데이터에 포함된 정답 이미지 간의 차이에 의해 계산된 손실함수를 의미한다. loss_1, loss_2, loss_3은 학생 네트워크(200)와 교사 네트워크(100)의 각 필터의 출력 특징맵 간 차이에 의해 계산된 손실함수를 의미한다. T_n은 교사 네트워크(100)를 구성하는 잔여 블록의 출력값에 대한 TAT 모듈의 출력값, S_n은 학생 네트워크(200)를 구성하는 스퀴즈 블록의 출력값에 대한 TAT 모듈의 출력값을 의미한다.In Equation 2, y means a high-resolution image output through the student network 200,

means the correct answer image (Ground Truth, GT) used for learning. loss_0 is a loss function according to the learning result of the student network 200 in step S110, and means a loss function calculated by the difference between the output image of the student network 200 and the correct answer image included in the learning data. loss_1, loss_2, and loss_3 mean loss functions calculated by the difference between the output feature maps of each filter of the student network 200 and the teacher network 100 . T_n denotes an output value of the TAT module with respect to an output value of the remaining blocks constituting the teacher network 100 , and S_n denotes an output value of the TAT module with respect to an output value of a squeeze block constituting the student network 200 .

즉 영상 초해상화 장치(10)의 최종적인 손실 함수는 학생 네트워크(200)의 분류 성능에 대한 손실 함수인 loss_0와 교사 네트워크(100)와 학생 네트워크(200)의 분류 결과의 차이를 나타내는 손실 함수인 loss_1 내지 loss_3의 합으로 계산된다. That is, the final loss function of the image super-resolution device 10 is loss_0, which is a loss function for the classification performance of the student network 200 , and a loss function indicating the difference between the classification result of the teacher network 100 and the student network 200 . It is calculated as the sum of loss_1 to loss_3.

한편 본 명세서에서는 교사 네트워크(100)의 잔여 블록과 학생 네트워크(200)의 스퀴즈 블록이 각 3개이므로 3개의 손실함수 항이 추가된 것이며, 추가되는 손실함수 항의 개수는 반드시 이에 제한되는 것은 아니다. Meanwhile, in the present specification, since there are three residual blocks of the teacher network 100 and three squeeze blocks of the student network 200 , three loss function terms are added, and the number of added loss function terms is not necessarily limited thereto.

도 7은 한 실시예에 따른 영상 초해상화 장치의 동작 방법의 예시도이다.7 is an exemplary diagram of a method of operating an image super-resolution apparatus according to an embodiment.

도 7의 (a)는 정답 이미지이고, 도 7의 (b)는 해당 이미지를 바이큐빅 보간법으로 복원한 결과이다. 바이큐빅 보간법(Bicubic Interpolation)은 인접한 16개 화소의 화소값과 거리에 따른 가중치의 곱을 이용하여 홀을 복원하는 보간법이다.Fig. 7 (a) is a correct answer image, and Fig. 7 (b) is a result of restoring the image by bicubic interpolation. Bicubic interpolation is an interpolation method for reconstructing a hole by using the product of pixel values of 16 adjacent pixels and weights according to distance.

도 7의 (c)는 학습된 교사 네트워크(100)의 복원 결과이고, 도 7의 (d)는 매개 변수를 추가하지 않고 교사 네트워크(100)로부터 학습된 학생 네트워크(200)의 복원 결과를 나타낸다.Fig. 7(c) shows the restoration result of the learned teacher network 100, and Fig. 7(d) shows the restoration result of the student network 200 learned from the teacher network 100 without adding parameters. .

학생 네트워크(200)의 파라미터 수는 교사 네트워크(100)의 파라미터 수에 비해 100배 낮고, 알고리즘 처리 속도는 약 10배 정도 빠르기 때문에 학생 네트워크(200)는 실시간 처리가 가능하다. Since the number of parameters of the student network 200 is 100 times lower than that of the teacher network 100 and the algorithm processing speed is about 10 times faster, the student network 200 can perform real-time processing.

도 8은 한 실시예에 따른 컴퓨팅 장치의 하드웨어 구성도이다.8 is a hardware configuration diagram of a computing device according to an embodiment.

도 8을 참고하면, 영상 초해상화 장치(10)는 적어도 하나의 프로세서에 의해 동작하는 컴퓨팅 장치(300)에서, 본 발명의 동작을 실행하도록 기술된 명령들(instructions)이 포함된 프로그램을 실행한다. Referring to FIG. 8 , the image super-resolution apparatus 10 executes a program including instructions described to execute the operation of the present invention in the computing device 300 operated by at least one processor do.

컴퓨팅 장치(300)의 하드웨어는 적어도 하나의 프로세서(310), 메모리(320), 스토리지(330), 통신 인터페이스(340)를 포함할 수 있고, 버스를 통해 연결될 수 있다. 이외에도 입력 장치 및 출력 장치 등의 하드웨어가 포함될 수 있다. 컴퓨팅 장치(300)는 프로그램을 구동할 수 있는 운영 체제를 비롯한 각종 소프트웨어가 탑재될 수 있다.The hardware of the computing device 300 may include at least one processor 310 , a memory 320 , a storage 330 , and a communication interface 340 , and may be connected through a bus. In addition, hardware such as an input device and an output device may be included. The computing device 300 may be loaded with various software including an operating system capable of driving a program.

프로세서(310)는 컴퓨팅 장치(300)의 동작을 제어하는 장치로서, 프로그램에 포함된 명령들을 처리하는 다양한 형태의 프로세서(310)일 수 있고, 예를 들면, CPU(Central Processing Unit), MPU(Micro Processor Unit), MCU(Micro Controller Unit), GPU(Graphic Processing Unit) 등 일 수 있다. 메모리(320)는 본 발명의 동작을 실행하도록 기술된 명령들이 프로세서(310)에 의해 처리되도록 해당 프로그램을 로드한다. 메모리(320)는 예를 들면, ROM(read only memory), RAM(random access memory) 등 일 수 있다. 스토리지(330)는 본 발명의 동작을 실행하는데 요구되는 각종 데이터, 프로그램 등을 저장한다. 통신 인터페이스(340)는 유/무선 통신 모듈일 수 있다.The processor 310 is a device for controlling the operation of the computing device 300 and may be various types of processors 310 that process instructions included in a program, for example, a central processing unit (CPU), an MPU (Central Processing Unit) It may be a micro processor unit), a micro controller unit (MCU), a graphic processing unit (GPU), or the like. The memory 320 loads the corresponding program so that the instructions described to execute the operation of the present invention are processed by the processor 310 . The memory 320 may be, for example, read only memory (ROM), random access memory (RAM), or the like. The storage 330 stores various data and programs required for executing the operation of the present invention. The communication interface 340 may be a wired/wireless communication module.

이상에서 설명한 본 발명의 실시예는 장치 및 방법을 통해서만 구현이 되는 것은 아니며, 본 발명의 실시예의 구성에 대응하는 기능을 실현하는 프로그램 또는 그 프로그램이 기록된 기록 매체를 통해 구현될 수도 있다.The embodiment of the present invention described above is not implemented only through the apparatus and method, and may be implemented through a program for realizing a function corresponding to the configuration of the embodiment of the present invention or a recording medium in which the program is recorded.

이상에서 본 발명의 실시예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다.Although the embodiments of the present invention have been described in detail above, the scope of the present invention is not limited thereto. is within the scope of the right.

Claims

A method of operating a computing device operated by at least one processor, comprising:
Learning a teacher model using learning data including a low-resolution image and a high-resolution image corresponding to the low-resolution image, and initially learning a student model with the learning data;
weighting feature values with high importance among feature values generated in the learning process of the teacher model, and further learning the student model using the important feature values; and
inputting an arbitrary image to the student model, and outputting an image with an increased resolution of the arbitrary image,
The student model is a deep learning model of the same size or smaller than the teacher model, the operating method.

In claim 1,
The teacher model is
At least one residual block including a plurality of convolutional layers for extracting features of the training data, an activation function that delivers the results of the convolutional layers, and a multi-layer that scales the results of the activation function. to do, the way it works.

In claim 2,
The step of further learning the student model is,
An operating method of modifying a loss function of the student model using output values of each residual block of the teacher model.

In claim 3,
The step of further learning the student model is,
Of the output values of each of the residual blocks, the teacher model is determined as important information for generating the high-resolution image from the low-resolution image and determines output values of the weighted residual blocks as the important feature values.

A computing device comprising:
memory, and
at least one processor executing instructions of a program loaded into the memory;
the program is
Initial learning of a student model using learning data including a low-resolution image and a high-resolution image corresponding to the low-resolution image;
extracting output values generated in the learning process of the teacher model from the teacher model for which learning is completed using the learning data, and re-learning the student model using the output values; and
instructions described for executing the steps of inputting an arbitrary low-resolution image to the student model and outputting a high-resolution image of the arbitrary image;
The student model is a deep learning model of the same size or smaller than the teacher model, computing device.

In claim 5,
The re-learning step is
Among the output values, the teacher model assigns weights to important output values determined as important information for generating the high-resolution image from the low-resolution image, and uses the weighted important output values to correct the loss function of the student model , computing devices.

In claim 6,
The teacher model and the student model are
At least one block including a plurality of convolutional layers for extracting features of the training data and an activation function for transferring the results of the convolutional layers,
The re-learning step is
A computing device for extracting output values from each block included in the teacher model.

In claim 6,
The re-learning step is
A computing device using a loss function used for the initial learning and a loss function due to a difference between the important output values and output values output during an initial learning process of the student model.