KR102064581B1

KR102064581B1 - Apparatus and Method for Interpolating Image Autoregressive

Info

Publication number: KR102064581B1
Application number: KR1020180041121A
Authority: KR
Inventors: 전광길
Original assignee: 인천대학교 산학협력단
Priority date: 2018-04-09
Filing date: 2018-04-09
Publication date: 2020-01-10
Also published as: KR20190118023A

Abstract

이미지 자동 회귀 보간 장치 및 방법은 GPU 병렬 최적화 섹션에서 손실된 픽셀의 보간 시 어두운 점의 경우 대각선 보간법을 수행하고 회색 점의 경우 수직 보간법의 2 단계로 보간하여 공간적 디테일과 엣지 선명도가 우수하고, 자동 회귀 모델에서 대각선 또는 수직 보간을 수행 시 서로 독립적이므로 CUDA에 의한 병렬 최적화가 가능하다. Image auto-regressive interpolation device and method performs diagonal interpolation for dark spots and vertical interpolation for gray spots in the GPU parallel optimization section to achieve better spatial detail and edge clarity. Since diagonal or vertical interpolation is independent of each other in the regression model, parallel optimization by CUDA is possible.

Description

Apparatus and Method for Interpolating Image Autoregressive}

본 발명은 이미지 자동 회귀 보간 방법에 관한 것으로서, 특히 GPU 병렬 최적화 섹션에서 손실된 픽셀의 보간 시 어두운 점의 경우 대각선 보간법을 수행하고 회색 점의 경우 수직 보간법의 2 단계로 보간하여 공간적 디테일과 엣지 선명도가 우수한 이미지 자동 회귀 보간 장치 및 방법에 관한 것이다.The present invention relates to a method for automatic image regression interpolation. In particular, in the GPU parallel optimization section, when interpolating lost pixels, diagonal interpolation is performed for dark spots and vertical interpolation is performed for gray dots. Relates to an excellent image auto-regressive interpolation device and method.

높은 비디오 품질로 인해 초고속(HD) 제품이 점점 인기를 얻고 있다.High video quality (HD) products are becoming increasingly popular due to the high video quality.

그러나 울트라 HD 해상도는 실시간 재구성을 위해 시간 복잡성이 낮은 효과적인 알고리즘을 필요로 한다.Ultra HD resolution, however, requires effective algorithms with low time complexity for real-time reconstruction.

보간 방법은 이미지 처리에서 중요한 방법이며 고해상도(HR) 이미지를 생성하기 위해 저해상도(LR) 이미지를 사용하는 울트라 HD 제품의 기본 요구 사항이다.Interpolation is an important method for image processing and is a fundamental requirement for Ultra HD products that use low resolution (LR) images to produce high resolution (HR) images.

이러한 보간 방법은 이미지 배율, 디인터레이싱 및 노이즈 제거에 널리 사용된다. 종래의 쌍 선형 보간, 3차 컨벌루션 보간법, 3차 스플라인 보간법은 상대적으로 낮은 복잡성 때문에 실시간 응용에 널리 사용된다.This interpolation method is widely used for image magnification, deinterlacing and noise removal. Conventional bilinear interpolation, cubic convolution interpolation and cubic spline interpolation are widely used in real time applications because of their relatively low complexity.

그러나 종래의 보간 방법은 흐림, 울림, 톱니 모양 및 지퍼링과 같은 보간 아티팩트로 이어질 수 있다. 이러한 문제점을 해결하기 위해서 최근에는 더 나은 디스플레이 해상도에 대한 요구를 충족시키기 위해서 종래의 보간 방법보다 더 우수한 알고리즘이 제안되고 있다.However, conventional interpolation methods can lead to interpolation artifacts such as blurring, ringing, serrations and zipper rings. In order to solve this problem, in recent years, an algorithm that is superior to the conventional interpolation method has been proposed to satisfy the demand for better display resolution.

제안된 보간 방법은 LR 영상의 공분산으로부터 HR 영상의 공분산을 추정하고 추정된 공분산을 기반으로 누락된 픽셀을 보간하도록 제안하였다. 최근에 소개된 가장 유망한 알고리즘 중 하나는 Zhang and Wu에 의해 제안된 바와 같이 자동 회귀 모델링 영상 보간법으로, 분할 방식의 2-D 자기 회귀(PAR) 모델을 사용하여 HR 이미지 블록을 블록 단위로 복구한다.The proposed interpolation method is proposed to estimate the covariance of HR images from the covariance of LR images and to interpolate missing pixels based on the estimated covariance. One of the most promising algorithms introduced recently is automatic regression modeling image interpolation, as proposed by Zhang and Wu, which uses blocky 2-D autoregressive (PAR) models to recover HR image blocks in blocks. .

자동 회귀 모델링 영상 보간법은 이미지 압축에도 사용되며 유망한 결과를 도출할 수 있었다.Automatic regression modeling image interpolation is also used for image compression and can yield promising results.

그러나 이러한 모든 자동 회귀 모델의 핵심은 시간 비용 접근 방식이었기 때문에 실제 제작 시 실시간 재구성 요구 사항을 충족시키기가 어려운 문제점이 있다.However, the core of all these autoregressive models was a time-cost approach, which made it difficult to meet real-time reconstruction requirements in production.

한국 공개특허번호 제10-2017-0012019호("발명의 명칭: 복수의 CPU 및 복수의 GPU를 지원하는 컴퓨팅 환경에서의 연산 방법")Korean Patent Laid-Open No. 10-2017-0012019 ("Invention name: Operation method in a computing environment supporting a plurality of CPUs and a plurality of GPUs")

이와 같은 문제점을 해결하기 위하여, 본 발명은 GPU 병렬 최적화 섹션에서 손실된 픽셀의 보간 시 어두운 점의 경우 대각선 보간법을 수행하고 회색 점의 경우 수직 보간법의 2 단계로 보간하여 공간적 디테일과 엣지 선명도가 우수한 이미지 자동 회귀 보간 장치 및 방법을 제공하는데 그 목적이 있다.In order to solve this problem, the present invention performs diagonal interpolation for dark dots and interpolation in two stages of vertical interpolation for gray dots in the GPU parallel optimization section to provide excellent spatial detail and edge sharpness. It is an object of the present invention to provide a device and method for automatic image regression interpolation.

상기 목적을 달성하기 위한 본 발명의 특징에 따른 이미지 자동 회귀 보간 방법은,Image automatic regression interpolation method according to a feature of the present invention for achieving the above object,

입력된 이미지 데이터의 고해상도 이미지에서 다운 샘플링을 통해 저해상도 이미지를 획득하는 단계;Obtaining a low resolution image through down sampling from a high resolution image of the input image data;

상기 획득한 저해상도 이미지에서 보간될 기준 픽셀을 중심으로 인접한 8개 연결된 이웃 픽셀 행렬과, 4개 연결된 이웃 픽셀 행렬을 생성하는 단계;Generating eight adjacent neighboring pixel matrices and four connected neighboring pixel matrices centered on the reference pixel to be interpolated in the obtained low resolution image;

상기 생성한 8개 연결된 이웃 픽셀 행렬과, 상기 생성한 4개 연결된 이웃 픽셀 행렬을 이용하여 특정 방향을 나타내는 제1 방향 계수를 가진 하나 이상의 제1 행렬을 구성하는 단계;Constructing at least one first matrix having the generated eight connected neighboring pixel matrices and the generated first connected neighboring pixel matrix having a first direction coefficient indicating a specific direction;

상기 복수의 제1 행렬을 기초로 상기 보간될 기준 픽셀에 대한 제1 이미지 보간을 수행하는 단계; 및Performing first image interpolation for the reference pixel to be interpolated based on the plurality of first matrices; And

상기 입력된 이미지 데이터에 제1 이미지 보간을 수행한 기준 픽셀을 반영하여 출력 이미지의 선명도를 향상시키는 단계를 포함한다.And improving the sharpness of the output image by reflecting the reference pixel subjected to the first image interpolation to the input image data.

본 발명의 특징에 따른 이미지 자동 회귀 보간 방법은,Automatic image regression interpolation method according to a feature of the present invention,

입력된 이미지 데이터의 고해상도 이미지에서 다운 샘플링을 통해 저해상도 이미지를 획득하는 제1 단계;A first step of obtaining a low resolution image through down sampling from a high resolution image of the input image data;

상기 획득한 저해상도 이미지에서 보간될 제1 기준 픽셀을 중심으로 인접한 대각선 방향에 위치하는 8개 연결된 제1 이웃 픽셀 행렬과, 4개 연결된 제1 이웃 픽셀 행렬을 생성하는 제2 단계;A second step of generating eight connected first neighboring pixel matrices positioned in diagonal directions adjacent to the first reference pixel to be interpolated in the obtained low resolution image and four connected first neighboring pixel matrices;

상기 생성한 8개 연결된 제1 이웃 픽셀 행렬과, 상기 생성한 4개 연결된 제1 이웃 픽셀 행렬을 이용하여 대각선 방향을 나타내는 제1 방향 계수를 가진 하나 이상의 제1 행렬을 구성하는 제3 단계;A third step of constructing at least one first matrix having the eight connected first neighboring pixel matrices and the first direction coefficients representing the diagonal directions by using the generated four connected first neighboring pixel matrices;

상기 복수의 제1 행렬을 기초로 상기 보간될 제1 기준 픽셀에 대한 제1 이미지 보간을 수행하는 제4 단계;Performing a first image interpolation on the first reference pixel to be interpolated based on the plurality of first matrices;

대각선 보간법에 의해 제1 이미지 보간을 수행한 제1 기준 픽셀을 상기 입력된 이미지 데이터에 반영하여 출력 이미지의 선명도를 향상시키는 제5 단계;A fifth step of improving the sharpness of the output image by reflecting the first reference pixel on which the first image interpolation has been performed by the diagonal interpolation method to the input image data;

상기 획득한 저해상도 이미지에서 보간될 제2 기준 픽셀을 중심으로 인접한 수직 방향에 위치하는 8개 연결된 제2 이웃 픽셀 행렬과, 4개 연결된 제2 이웃 픽셀 행렬을 생성하는 제6 단계;Generating an eight connected second neighboring pixel matrix located in a vertical direction adjacent to the second reference pixel to be interpolated in the obtained low resolution image, and a four connected second neighboring pixel matrix;

상기 생성한 8개 연결된 제2 이웃 픽셀 행렬과, 상기 생성한 4개 연결된 제2 이웃 픽셀 행렬을 이용하여 수직 방향을 나타내는 제2 방향 계수를 가진 하나 이상의 제2 행렬을 구성하는 제7 단계;A seventh step of constructing at least one second matrix having the eight connected second neighboring pixel matrices and the second direction coefficients representing the vertical direction by using the generated four connected second neighboring pixel matrices;

상기 복수의 제2 행렬을 기초로 상기 보간될 제2 기준 픽셀에 대한 제2 이미지 보간을 수행하는 제8 단계; 및An eighth step of performing a second image interpolation on the second reference pixel to be interpolated based on the plurality of second matrices; And

수직 보간법에 의해 제2 이미지 보간을 수행한 제2 기준 픽셀을 상기 입력된 이미지 데이터에 반영하여 출력 이미지의 선명도를 향상시키는 제9 단계를 포함한다.And a ninth step of improving the sharpness of the output image by reflecting the second reference pixel on which the second image interpolation has been performed by the vertical interpolation method to the input image data.

본 발명의 특징에 따른 이미지 자동 회귀 보간 장치는,Image automatic regression interpolation device according to a feature of the present invention,

입력된 이미지 데이터의 고해상도 이미지에서 다운 샘플링을 통해 저해상도 이미지를 획득하는 이미지 입력부;An image input unit which obtains a low resolution image through down sampling from a high resolution image of the input image data;

상기 획득한 저해상도 이미지에서 보간될 기준 픽셀을 중심으로 인접한 8개 연결된 이웃 픽셀 행렬과, 4개 연결된 이웃 픽셀 행렬을 생성하는 이웃 픽셀 행렬 생성부;A neighboring pixel matrix generator for generating eight adjacent neighboring pixel matrices and four connected neighboring pixel matrices centered on the reference pixel to be interpolated in the obtained low resolution image;

상기 생성한 8개 연결된 이웃 픽셀 행렬과, 상기 생성한 4개 연결된 이웃 픽셀 행렬을 이용하여 특정 방향을 나타내는 제1 방향 계수를 가진 하나 이상의 제1 행렬을 구성하는 방향 계수 행렬 생성부;A direction coefficient matrix generator configured to form one or more first matrices having the eight connected neighbor pixel matrices and the first direction coefficients indicating a specific direction using the generated four connected neighbor pixel matrices;

상기 복수의 제1 행렬을 기초로 상기 보간될 기준 픽셀에 대한 제1 이미지 보간을 수행하는 이미지 보간 계산부; 및An image interpolation calculator configured to perform first image interpolation on the reference pixels to be interpolated based on the plurality of first matrices; And

상기 입력된 이미지 데이터에 제1 이미지 보간을 수행한 기준 픽셀을 반영하여 출력 이미지의 선명도를 향상시키는 이미지 출력부를 포함한다.And an image output unit configured to reflect the reference pixel subjected to the first image interpolation to the input image data to improve the sharpness of the output image.

전술한 구성에 의하여, 본 발명은 손실된 픽셀의 보간 시 어두운 점의 경우 대각선 보간법을 수행하고 회색 점의 경우 수직 보간법의 2 단계로 보간하여 공간적 디테일과 엣지 선명도가 우수한 효과가 있다.According to the above-described configuration, the present invention performs an interpolation method in the case of dark points when interpolating lost pixels, and interpolates in two stages of vertical interpolation in the case of gray points, thereby providing excellent spatial detail and edge sharpness.

본 발명은 자동 회귀 모델에서 대각선 또는 수직 보간을 수행 시 서로 독립적이므로 CUDA에 의한 병렬 최적화가 가능하다.Since the present invention is independent of each other when performing diagonal or vertical interpolation in the automatic regression model, parallel optimization by CUDA is possible.

도 1은 본 발명의 실시예에 따른 보간 모델을 나타낸 도면이다.
도 2는 본 발명의 실시예에 따른 이미지 자동 회귀 보간 모델에 기초한 두 개의 보간 단계를 나타낸 도면이다.
도 3은 본 발명의 실시예에 따른 8-연결된 이웃 픽셀들과 4-연결된 이웃 픽셀들의 파라미터들의 결정 모델을 나타낸 도면이다.
도 4는 본 발명의 실시예에 따른 이미지 자동 회귀 보간 방법을 나타낸 도면이다.
도 5는 본 발명의 실시예에 따른 이미지 자동 회귀 보간 장치의 구성을 간략하게 나타낸 도면이다.
도 6은 본 발명의 실시예에 따른 데이터와 CUDA 스레드 간의 대응 관계를 나타낸 도면이다.
도 7은 샘플 이미지 데이터의 블록 크기 간 속도를 비교한 결과를 나타낸 도면이다.
도 8은 본 발명의 테슬라 K80를 이용한 비동기 데이터 전송을 나타낸 도면이다.
도 9는 단일 GPU와 다중 GPU의 비교 결과를 나타낸 도면이다.1 is a diagram illustrating an interpolation model according to an embodiment of the present invention.
2 is a diagram illustrating two interpolation steps based on an image autoregressive interpolation model according to an exemplary embodiment of the present invention.
3 illustrates a decision model of parameters of 8-connected neighboring pixels and 4-connected neighboring pixels according to an embodiment of the present invention.
4 is a diagram illustrating an automatic image regression interpolation method according to an embodiment of the present invention.
FIG. 5 is a diagram schematically illustrating a configuration of an automatic image regression interpolation apparatus according to an exemplary embodiment of the present invention.
6 is a diagram illustrating a correspondence relationship between data and a CUDA thread according to an exemplary embodiment of the present invention.
7 is a diagram illustrating a result of comparing speeds between block sizes of sample image data.
8 illustrates asynchronous data transmission using Tesla K80 of the present invention.
9 is a diagram illustrating a comparison result between a single GPU and multiple GPUs.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다.Throughout the specification, when a part is said to "include" a certain component, it means that it can further include other components, without excluding other components unless specifically stated otherwise.

고성능 병렬 컴퓨팅의 개발 기술을 사용하여 GPU 기반 가속은 저비용 및 효율적인 전력 사용을 위해 과학 컴퓨팅 측면에서 사용된다.Using the development technology of high-performance parallel computing, GPU-based acceleration is used from the scientific computing side for low cost and efficient power usage.

CUDA는 NVIDIA GPU의 병렬 컴퓨팅 엔진을 활용하여 많은 복잡한 계산 문제를 해결하는 범용 병렬 컴퓨팅 플랫폼 및 프로그래밍 모델이다. 보간될 픽셀은 자동 회귀 모델에서 대각선 또는 수직 보간의 수행 시 서로 독립적이므로 CUDA에 의한 병렬 최적화에 적합하다.CUDA is a general-purpose parallel computing platform and programming model that leverages the parallel computing engine of NVIDIA GPUs to solve many complex computational problems. The pixels to be interpolated are independent of each other when performing diagonal or vertical interpolation in the automatic regression model, which is suitable for parallel optimization by CUDA.

자동 회귀 모델링 이미지 보간 방법의 주요 측면 중 하나는 적절한 훈련 윈도우를 사용하는 것이다.Automatic Regression Modeling One of the main aspects of the image interpolation method is the use of appropriate training windows.

보간될 픽셀들의 세밀한 부분을 포착하기 위해서는 필요한 계산을 줄이면서 8각형 윈도우인 훈련 윈도우를 사용할 수 있으며, 4개의 픽셀은 더 큰 슬립 윈도우에서 동시에 추정되며 직렬 컴퓨팅이 중요하다.To capture the finer details of the pixels to be interpolated, we can use the training window, an octagonal window, with fewer computations needed, with four pixels being estimated simultaneously in a larger sleep window and serial computing is important.

레지스터, 공유 메모리 및 GPU의 기타 리소스 제한을 고려할 때 각 스레드의 로드가 너무 크면 동시성이 크게 떨어진다. 보간될 픽셀을 계산하려면 여러 행렬 변환, 행렬 곱셈 및 행렬 반전을 포함하여 많은 수의 행렬 연산을 수행해야 한다.Given the registers, shared memory, and other resource limitations of the GPU, if each thread's load is too heavy, then concurrency is greatly reduced. To calculate the pixels to be interpolated requires a large number of matrix operations, including multiple matrix transformations, matrix multiplications, and matrix inversions.

따라서, 본 발명은 훈련 원도우를 좀 더 간결하게 수정하고, 높은 PSNR과 주관적인 시각적 품질을 유지하면서 GPU에서 빠른 속도를 달성한다.Thus, the present invention modifies the training window more concisely and achieves high speed on the GPU while maintaining high PSNR and subjective visual quality.

도 1은 본 발명의 실시예에 따른 보간 모델을 나타낸 도면이다. 1 is a diagram illustrating an interpolation model according to an embodiment of the present invention.

(1) 이미지 자동 회귀 보간 모델(1) image automatic regression interpolation model

보간 이미지를 기초로 한 자동 회귀 보간 모델의 공간적 디테일과 에지 선명도는 종래의 보간 방법에 의해 생성된 것보다 더 우수한 성능을 보여준다.The spatial detail and edge sharpness of an autoregressive interpolation model based on interpolation images show better performance than that produced by conventional interpolation methods.

본 발명은 PSNR과 주관적 시각적 품질 측면에서 상당히 더 나은 결과를 나타낸다.The present invention shows significantly better results in terms of PSNR and subjective visual quality.

I_h는 추정할 HR 이미지이고, I_l은 관찰된 LR 이미지이며, 이들은 두 가지 인수에 의해 HR 이미지의 다운 샘플링 버전을 나타낸다.I _h is the HR image to be estimated and I _l is the observed LR image, which represents a down-sampled version of the HR image by two factors.

본 발명은 다음의 [수학식 1]과 같이 이미지를 모델링한다.The present invention models the image as shown in Equation 1 below.

여기서, W는 도 2에 도시된 바와 같이, 4 × 4 미지의 픽셀을 포함한 로컬 윈도우이고, V_i,j는 공간적 위치 및 이미지 신호와 무관한 랜덤 섭동(Random Perturbation)이다.Here, W is a local window including 4 x 4 unknown pixels, as shown in FIG. 2, and V _{i, j} is random perturbation independent of spatial position and image signal.

는 픽셀 X(i,j)의

번째 이웃에 대한 자기 회귀 계수,

는 이미지들의 픽셀들,

는 HR 이미지에서 위치 i의 픽셀의 이웃이다.

Is equal to pixel X (i, j)

Regression coefficient for the first neighbor,

Is the pixels of the images,

Is the neighborhood of the pixel at position i in the HR image.

이미지 자동 회귀 보간 모델은 미지의 픽셀에 의해 알려진 픽셀을 기술할 수 있고 기하학적 이원성에 따라 알려진 픽셀에 의해 미지의 픽셀을 기술할 수 있다.The image autoregressive interpolation model can describe the pixels known by the unknown pixels and can describe the unknown pixels by the known pixels according to geometric duality.

파라미터

은 작은 지역에서 일정하거나 거의 일정하게 유지된다. parameter

Is constant or almost constant in small areas.

가장자리 및 텍스처와 같은 구조는 로컬 윈도우의 샘플을 피팅하여 학습할 수 있습니다. 이미지 자동 회귀 보간 모델은 대각선 보간법과 수직 보간법의 두 부분으로 구성된다. 대각선 보간법은 알려진 이웃 픽셀인 HR 픽셀을 계산하는데 사용된다.Structures such as edges and textures can be learned by fitting samples from the local window. The image autoregressive interpolation model consists of two parts: diagonal interpolation and vertical interpolation. Diagonal interpolation is used to calculate HR pixels, which are known neighboring pixels.

대각선 보간법에서 추정된 픽셀은 수직 보간법에 있어서 HR 이미지에서 알려진 것으로 간주된다. 먼저, 다운 샘플링을 통해 HR 이미지로부터 LR 이미지를 얻는다.Pixels estimated in diagonal interpolation are considered known in HR images in vertical interpolation. First, the LR image is obtained from the HR image through down sampling.

도 2는 본 발명의 실시예에 따른 이미지 자동 회귀 보간 모델에 기초한 두 개의 보간 단계를 나타낸 도면이고, 도 3은 본 발명의 실시예에 따른 8-연결된 이웃 픽셀들과 4-연결된 이웃 픽셀들의 파라미터들의 결정 모델을 나타낸 도면이다.2 is a diagram illustrating two interpolation steps based on an image autoregressive interpolation model according to an embodiment of the present invention, and FIG. 3 is a parameter of 8-connected neighbor pixels and 4-connected neighbor pixels according to an embodiment of the present invention. Figure showing the decision model of these.

도 2를 보면, 흰색 점은 LR 이미지 픽셀이고 어두운 점과 회색 점은 손실된 HR 픽셀로 두 단계로 보간된다.2, white points are LR image pixels and dark and gray points are interpolated in two steps with missing HR pixels.

이후에, 어두운 점은 대각선 보간법으로 보간되고, 마지막으로 회색 점이 수직 보간법으로 보간된다.Afterwards, dark points are interpolated by diagonal interpolation, and finally gray points are interpolated by vertical interpolation.

수직 보간법은 대각선 보간법과 유사한 방식으로 수행되며, 이하에서는 대각선 보간법만을 설명하기로 한다.Vertical interpolation is performed in a manner similar to diagonal interpolation. Hereinafter, only diagonal interpolation will be described.

도 2는 본 발명의 이미지 자동 회귀 보간 방법의 훈련 원도우를 보여준다.2 shows a training window of the image automatic regression interpolation method of the present invention.

도 2는 SAI(Soft-Decision Adaptive Interpolation)의 공간 구성으로 흰색 점이 기존의 LR 픽셀이고, 어두운 점이 손실된 HR 픽셀이고,

는 8-연결된 이웃 방향 계수이고,

는 4-연결된 이웃 방향 계수이며, W는 훈련 윈도우이다.2 is a spatial configuration of soft-decision adaptive interpolation (SAI), white dots are conventional LR pixels, dark dots are HR pixels lost,

Is an 8-connected neighbor direction coefficient,

Is the four-connected neighbor direction coefficient and W is the training window.

흰색 점은 x로 표시된 LR 픽셀이고 y로 표시된 어두운 점이 수직 보간법으로 보간되는 손실된 HR 픽셀이다.The white point is the LR pixel denoted by x and the dark HR denoted by y is the lost HR pixel where interpolation is performed by vertical interpolation.

보간법은 자승법(Square Problem)을 따르도록 공식화되고, 다음의 [수학식 2]와 같이 나타내며, 파라미터 λ는 라그랑지안 계수(Lagrangian Factor)이다.The interpolation method is formulated to follow the Square Problem, and is represented by Equation 2 below, and the parameter λ is a Lagrangian Factor.

도 3에 도시된 바와 같이,

파라미터는 종래 기술에 의해 최소 제공법으로부터 하기의 [수학식 3]과 [수학식 4]와 같이 추정될 수 있다. 이러한 방법은 이미지가 부분적으로 고정되어 있다는 기하학적 이중성 가정에 기반하므로 LR 픽셀 공분산과 HR 픽셀 공분산의 사이에 대응된다.As shown in FIG. 3,

The parameters can be estimated by the conventional techniques as shown in Equations 3 and 4 below from the minimum providing method. This method is based on the geometric duality assumption that the image is partially fixed and therefore corresponds between LR pixel covariance and HR pixel covariance.

도 3에 도시된 바와 같이,

(8-연결된 이웃)과

(4-연결된 이웃)의 파라미터들의 결정 모델은 이미 알려진 픽셀에 의해 추정할 수 있다.As shown in FIG. 3,

(8-connected neighbors)

The decision model of the parameters of (four-connected neighbor) can be estimated by known pixels.

여기서,

는 로컬 원도우의 내부의 M² 픽셀을 포함하는 데이터 벡터이다.here,

Is a data vector containing M ² pixels inside the local window.

A는 xi의 네 개의 8-연결된 이웃인

을 포함하는 i^th행 벡터의 M²×4 행렬이고, B는 xi의 네 개의 4-연결된 이웃인

을 포함하는 i^th행 벡터의 M²×4 행렬이다. t={1,2,3,4}이고 M=4이다.A is the four 8-connected neighbors of xi

Is an M ² × 4 matrix of row vectors i ^th , where B is the four 4-connected neighbors of xi

Is an M ² × 4 matrix of row vectors i ^th containing. t = {1,2,3,4} and M = 4.

최종적으로는 다음의 [수학식 5]와 [수학식 6]을 얻을 수 있다. Finally, the following [Equation 5] and [Equation 6] can be obtained.

전술한 [수학식 2]는

로 대체하여 [수학식 7]과 같이 다시 나타낼 수 있다.[Equation 2] described above

It can be replaced by Equation (7).

여기서,

는 도 2에 도시된 바와 같이, 현재 원도우에서 9개의 미지의 픽셀의 벡터이다. 픽셀

는 하나의 블록 내의 추정된 픽셀이다. 이것은 선형 최소 제곱 최적화 문제가 된다.here,

Is a vector of nine unknown pixels in the current window, as shown in FIG. pixel

Is an estimated pixel in one block. This is a linear least squares optimization problem.

의 최적 추정은 다음의 [수학식 8]과 같이 정의된다.

The optimal estimate of is defined as [Equation 8] below.

여기서, C와 D의 크기는 각각 14×9, 14×14이다.Here, the sizes of C and D are 14x9 and 14x14, respectively.

행렬 연산은 시간이 많이 소요되므로 중간 과정을 줄여야 한다. 행렬 D는 희소하므로 명시적으로

로 다시 작성하며,

로 정의한다.Matrix operations are time consuming and require less intermediate steps. Matrix D is sparse, so

Rewrite as

It is defined as

행렬 C는 항등 행렬 I와 행렬 H로 분해되고, I₉는 9 × 9 항등 행렬이다.Matrix C is decomposed into identity matrix I and matrix H, and I ₉ is a 9 × 9 identity matrix.

[수학식 8]은 다음의 [수학식 10]과 같이 재정의된다.Equation 8 is redefined as in Equation 10 below.

하기의 [수학식 12]의 파라미터 λ는 전술한 [수학식 2]에서의 라그랑지안 계수(Lagrangian Factor)이다.The parameter λ of Equation 12 below is a Lagrangian Factor in Equation 2 described above.

전술한 [수학식 10]에 의해 요구되는 계산량은 [수학식 8]에 의해 요구되는 것보다 작다.The amount of calculation required by Equation (10) described above is smaller than that required by Equation (8).

실험에서 샘플 이미지는 [표 1]과 같이 Lena, Peppers, Pentagon 및 Airport이다.Sample images in the experiment are Lena, Peppers, Pentagon and Airport as shown in Table 1.

[표 2]는 PSNR과 CPU 실행 시간을 비교합니다. 알고리즘의 실행 시간과 SAI(Soft-Decision Adaptive Interpolation)의 실행 시간을 비교했다. 모든 시뮬레이션은 Intel 코어 i7-920, 2.66GHz CPU에서 수행되었다.Table 2 compares PSNR and CPU execution time. The execution time of the algorithm is compared with that of the soft-decision adaptive interpolation (SAI). All simulations were performed on Intel Core i7-920, 2.66GHz CPUs.

[표 2]는 본 발명의 이미지 자동 회귀 보간 방법의 CPU 실행 시간(단일 스레드)이 SAI의 CPU 실행 시간보다 약간 감소되었음을 보여준다. 이것은 행렬이 작아지기 때문에 필요한 계산량이 크게 줄어든다.Table 2 shows that the CPU execution time (single thread) of the image automatic regression interpolation method of the present invention is slightly reduced than the CPU execution time of SAI. This greatly reduces the amount of computation required because the matrix is smaller.

SAI에서 C와 D의 크기는 21 × 12 및 21 × 21에 의해 결정되고, 이것은 [수학식 10]과 [수학식 11]의 행렬보다 더 크다.The magnitudes of C and D in SAI are determined by 21 × 12 and 21 × 21, which is larger than the matrices in Equations 10 and 11.

본 발명의 이미지 자동 회귀 보간 방법은 GPU 병렬 컴퓨팅을 준비 할 때 슬립 윈도우에서 하나의 픽셀을 얻는다. 즉, 인접한 훈련 윈도우 간에 더 많은 데이터가 중복된다. 그러나 SAI에서 4개의 픽셀은 동시에 더 큰 슬립 윈도우를 추정할 수 있으며, 데이터 중복을 줄일 수 있다.The image automatic regression interpolation method of the present invention obtains one pixel in the sleep window when preparing for GPU parallel computing. That is, more data is duplicated between adjacent training windows. However, in SAI, four pixels can estimate a larger sleep window at the same time and reduce data redundancy.

따라서, 다음 섹션에서는 GPU를 기반으로 자동 회귀 보간 모델을 최적화한다. 본 발명은 GPU를 최대한 활용하기 위해서 많은 CUDA 최적화 전략을 사용하며, 이들 각각은 이러한 시간 소모 문제를 효과적으로 해결하는 데 도움이 된다. PSNR 및 시간은 각 최적화에서 제공된다.Therefore, the next section optimizes the autoregressive interpolation model based on the GPU. The present invention uses many CUDA optimization strategies to make the most of the GPU, each of which helps to effectively solve this time consuming problem. PSNR and time are provided for each optimization.

(2) 이미지를 위한 GPU 기반 최적화(자동 회귀 보간 모델)(2) GPU-based optimization for images (automatic regression interpolation model)

A. 실험장치 설명A. Experimental Equipment

본 발명은 NVIDIA Tesla K80에서 이미지 자동 회귀 보간 모델을 위한 GPU 기반 최적화를 수행한다. 이러한 GPU 기반 최적화는 2개의 GPU를 포함하며 2 × 2496 CUDA 코어와 총 26개의 SMX 유닛을 포함한다.The present invention performs GPU based optimization for image autoregressive interpolation model in NVIDIA Tesla K80. This GPU-based optimization includes two GPUs, a 2 x 2496 CUDA core and a total of 26 SMX units.

이전 GPU 아키텍처와 비교할 때 테슬라 K80의 레지스터와 공유 메모리의 수는 원본의 거의 두 배로 크게 향상되었다. GPU들은 저렴한 비용과 효율적인 전력 사용에 유리한 장점이 있다. 본 발명의 실험에서 직렬 C 코드는 O2 컴파일 최적화 기능을 갖춘 CPU(인텔 코어 TM i7-920, 2.66 GHz)에서 실행되었다.Compared to previous GPU architectures, the number of registers and shared memory on the Tesla K80 has been greatly improved, nearly double that of the original. GPUs have the advantage of low cost and efficient power usage. In the experiments of the present invention, serial C code was executed on a CPU (Intel Core ™ i7-920, 2.66 GHz) with O2 compilation optimization.

CUDA 프로그램을 실행할 때 Tesla K80의 모든 블록에서 공유 메모리의 기본 총 용량은 49,152B이고 블록 당 사용 가능한 총 레지스터 수는 65,536이다. 본 발명의 장치 사양은 하기의 [표 3]에 기재되어 있다.When running a CUDA program, the default total amount of shared memory in all blocks of the Tesla K80 is 49,152B, and the total number of registers available per block is 65,536. The device specifications of the present invention are described in Table 3 below.

B. CUDA 기반한 알고리즘 흐름B. CUDA based algorithm flow

CUDA C는 C 프로그래밍 언어에 익숙한 사용자에게 간단한 경로를 제공하여 장치에서 실행될 프로그램을 쉽게 작성할 수 있다. CUDA 프로그램에서 GPU는 커널이라는 병렬 측면을 담당한다.CUDA C provides a simple path for users familiar with the C programming language, making it easy to write programs to run on the device. In CUDA programs, the GPU is responsible for the parallel aspect of the kernel.

현재 병렬 기술은 과학과 기술의 모든 측면에서 사용되고 GPU 기반 이미지 처리 모델은 널리 보급되어 있다. 병렬 컴퓨팅은 이미지 처리에서 엄청난 잠재력을 보여 주고 있으며, 시간 소모적인 방법과 관련하여 중요하다. CUDA 코드는 GPU의 메모리로 데이터를 전송하고, CUDA 커널을 실행하며, GPU의 결과를 CPU의 메모리로 전송하는 3 단계로 구성된다.Parallel technology is now used in all aspects of science and technology, and GPU-based image processing models are widespread. Parallel computing has shown tremendous potential in image processing and is important in terms of time-consuming methods. CUDA code consists of three steps: transferring data to the GPU's memory, executing the CUDA kernel, and transferring the GPU's results to the CPU's memory.

도 4는 본 발명의 실시예에 따른 이미지 자동 회귀 보간 방법을 나타낸 도면이고, 도 5는 본 발명의 실시예에 따른 이미지 자동 회귀 보간 장치의 구성을 간략하게 나타낸 도면이다.4 is a diagram illustrating an automatic image regression interpolation method according to an exemplary embodiment of the present invention, and FIG. 5 is a diagram schematically illustrating a configuration of an automatic image regression interpolation apparatus according to an exemplary embodiment of the present invention.

일부 응용 프로그램에서는 결과를 CPU로 다시 전송할 필요가 없지만 GPU에 직접 표시된다. 대각선 보간 과정과 수직 보간 과정은 동일한 과정을 거친다.Some applications do not need to send results back to the CPU, but are displayed directly on the GPU. The diagonal interpolation process and the vertical interpolation process go through the same process.

1) 입력된 이미지 데이터를 GPU로 전송하고 가장자리를 미러 복제한다.1) Transfer the input image data to the GPU and mirror the edges.

2) 훈련 방향 계수를 위한 슬라이드 훈련 윈도우에서 4개 연결된 인접 픽셀 행렬과 8개 연결된 인접 픽셀 행렬을 생성한다.2) Generate four connected adjacent pixel matrices and eight connected adjacent pixel matrices in the slide training window for the training direction coefficients.

3) 이전 단계의 방향 계수를 가진 행렬 H와 G를 구성한다.3) Construct matrices H and G with the direction coefficients of the previous step.

4) 전술한 [수학식 10]을 사용하여 보간할 픽셀을 계산한다.4) A pixel to be interpolated is calculated by using Equation 10 described above.

5) GPU의 결과를 CPU의 메모리로 전달한다.5) The result of the GPU is transferred to the memory of the CPU.

다음의 최적화 알고리즘에서 실험의 대상 이미지 데이터는 이미지 프로세싱에서 가장 고전적인 이미지 데이터인 Lena, Peppers, Pentagon 및 Airport이다. In the following optimization algorithm, the subject image data to be tested are Lena, Peppers, Pentagon and Airport, which are the most classic image data in image processing.

[표 1]에 도시된 바와 같이, Lena의 HR 이미지 크기는 512 × 512이고, HR 이미지에서 다운 샘플링된 LR 이미지의 크기는 256 × 256이다.As shown in Table 1, the HR image size of Lena is 512 × 512, and the size of the down sampled LR image in the HR image is 256 × 256.

본 발명의 이미지 자동 회귀 보간 방법은 두 단계가 있으며, 모든 프로세스는 GPU에서 병렬로 수행된다. 대각선 보간법이 완료되면, 이미지 데이터는 CPU로 전달되지 않고 수직 보간법을 위해 준비한다. 각 보간법 이전에 에지에 대한 미러 복제가 필요하다.The automatic image regression interpolation method of the present invention has two steps, and all processes are performed in parallel on the GPU. When the diagonal interpolation is completed, the image data is not transferred to the CPU but is prepared for vertical interpolation. Mirror replication of the edges is required before each interpolation.

수직 보간법와 대각선 보간법은 도 4와 같은 동일한 과정을 거친다.Vertical interpolation and diagonal interpolation go through the same process as in FIG. 4.

본 발명의 실시예에 따른 이미지 자동 회귀 보간 장치(100)는 이미지 입력부(110), 이웃 픽셀 행렬 생성부(120), 방향 계수 행렬 생성부(130), 이미지 보간 계산부(140) 및 이미지 출력부(150)를 포함한다.Image regression interpolation apparatus 100 according to an embodiment of the present invention is the image input unit 110, the neighbor pixel matrix generator 120, the direction coefficient matrix generator 130, the image interpolation calculator 140 and the image output The unit 150 is included.

이미지 입력부(110)는 입력된 이미지 데이터의 고해상도 이미지에서 다운 샘플링을 통해 저해상도 이미지를 획득한다(S100).The image input unit 110 obtains a low resolution image through down sampling from the high resolution image of the input image data (S100).

이웃 픽셀 행렬 생성부(120)는 획득한 저해상도 이미지에서 보간될 기준 픽셀을 중심으로 인접한 8개 연결된 이웃 픽셀 행렬(수학식 3, 수학식 5)과, 4개 연결된 이웃 픽셀 행렬(수학식 4, 수학식 6)을 생성한다(S102, S104).The neighboring pixel matrix generation unit 120 includes eight adjacent neighboring pixel matrices (Equation 3, Equation 5) and four connected neighboring pixel matrices (Equation 4, Equation 4) centered on the reference pixel to be interpolated in the obtained low resolution image. Equation (6) is generated (S102, S104).

방향 계수 행렬 생성부(130)는 LR 픽셀로부터 추정된 로컬 분산이 임계값을 초과하는지 여부를 판단하고(S108), 초과하는 경우 자동 회귀 분석을 기반으로 보간되며, 지역 편차가 임계값보다 낮으면 바이 큐빅(Bicubic)을 기반으로 보간된다(S110). The direction coefficient matrix generator 130 determines whether the local variance estimated from the LR pixel exceeds the threshold (S108), and if so, interpolates based on the automatic regression analysis. The interpolation is based on bicubic (S110).

지역 편차가 임계값보다 낮으면 바이 큐빅(Bicubic)을 기반으로 보간되므로 효율성을 희생하지 않고 계산량을 줄일 수 있다. 바이 큐빅 보간법(Bicubic Interpolation)과 같은 간단한 방법은 부드러운 영역을 보간하기에 충분하다. If the regional deviation is lower than the threshold, it is interpolated based on bicubic, which can reduce the computation without sacrificing efficiency. Simple methods such as bicubic interpolation are sufficient to interpolate smooth areas.

방향 계수 행렬 생성부(130)는 생성한 8개 연결된 이웃 픽셀 행렬과, 생성한 4개 연결된 이웃 픽셀 행렬을 이용하여 특정 방향을 나타내는 제1 방향 계수를 가진 하나 이상의 제1 행렬을 구성한다(수학식 11, 수학식 12)(S112).The direction coefficient matrix generator 130 constructs one or more first matrices having the eight connected neighboring pixel matrices generated and the first direction coefficients representing the specific directions using the generated four connected neighboring pixel matrices (math (11) (S112).

이미지 보간 계산부(140)는 복수의 제1 행렬을 기초로 보간될 기준 픽셀에 대한 제1 이미지 보간을 수행한다(수학식 10)(S114).The image interpolation calculator 140 performs first image interpolation on the reference pixel to be interpolated based on the plurality of first matrices (Equation 10) (S114).

이미지 출력부(150)는 입력된 이미지 데이터에 제1 이미지 보간을 수행한 기준 픽셀을 반영하여 출력 이미지의 선명도를 향상시킨다(S116).The image output unit 150 improves the sharpness of the output image by reflecting the reference pixel on which the first image interpolation is performed to the input image data (S116).

이미지 자동 회귀 보간 방법(대각선 보간법)은 입력된 이미지 데이터의 고해상도 이미지에서 다운 샘플링을 통해 저해상도 이미지를 획득하는 제1 단계와, 상기 획득한 저해상도 이미지에서 보간될 제1 기준 픽셀을 중심으로 인접한 대각선 방향에 위치하는 8개 연결된 제1 이웃 픽셀 행렬(수학식 3, 수학식 5)과, 4개 연결된 제1 이웃 픽셀 행렬(수학식 4, 수학식 6)을 생성하는 제2 단계와, 상기 생성한 8개 연결된 제1 이웃 픽셀 행렬과, 상기 생성한 4개 연결된 제1 이웃 픽셀 행렬을 이용하여 대각선 방향을 나타내는 제1 방향 계수를 가진 하나 이상의 제1 행렬을 구성하는 제3 단계(수학식 11, 수학식 12)와, 상기 복수의 제1 행렬을 기초로 상기 보간될 제1 기준 픽셀에 대한 제1 이미지 보간을 수행하는 제4 단계(수학식 10)와, 대각선 보간법에 의해 제1 이미지 보간을 수행한 제1 기준 픽셀을 상기 입력된 이미지 데이터에 반영하여 출력 이미지의 선명도를 향상시키는 제5 단계를 포함한다.Image automatic regression interpolation method (diagonal interpolation) is a first step of obtaining a low resolution image through down-sampling from a high resolution image of the input image data, and the diagonal direction adjacent to the first reference pixel to be interpolated in the obtained low resolution image A second step of generating eight connected first neighboring pixel matrices (Equation 3, Equation 5) and four connected first neighboring pixel matrices (Equation 4, Equation 6), wherein A third step of constructing at least one first matrix having eight connected first neighboring pixel matrices and first generated direction coefficients representing the diagonal directions by using the generated four connected first neighboring pixel matrices (Equation 11, Equation 12), a fourth step of performing a first image interpolation on the first reference pixel to be interpolated based on the plurality of first matrices (Equation 10), and a first image by diagonal interpolation. Reflected in the input to the first reference pixel image data by performing a liver and a fifth step of improving the sharpness of the output image.

이어서, 이미지 자동 회귀 보간 방법(수직 보간법)은 획득한 저해상도 이미지에서 보간될 제2 기준 픽셀을 중심으로 인접한 수직 방향에 위치하는 8개 연결된 제2 이웃 픽셀 행렬(수학식 3, 수학식 5)과, 4개 연결된 제2 이웃 픽셀 행렬(수학식 4, 수학식 6)을 생성하는 제6 단계와, 상기 생성한 8개 연결된 제2 이웃 픽셀 행렬과, 상기 생성한 4개 연결된 제2 이웃 픽셀 행렬을 이용하여 수직 방향을 나타내는 제2 방향 계수를 가진 하나 이상의 제2 행렬을 구성하는 제7 단계(수학식 11, 수학식 12)와, 상기 복수의 제2 행렬을 기초로 상기 보간될 제2 기준 픽셀에 대한 제2 이미지 보간을 수행하는 제8 단계(수학식 10)와, 수직 보간법에 의해 제2 이미지 보간을 수행한 제2 기준 픽셀을 상기 입력된 이미지 데이터에 반영하여 출력 이미지의 선명도를 향상시키는 제9 단계를 포함한다.The image autoregressive interpolation method (vertical interpolation method) then consists of eight connected second neighboring pixel matrices (Equation 3, Equation 5) located in the vertical direction adjacent to the second reference pixel to be interpolated in the obtained low resolution image. A sixth step of generating four connected second neighboring pixel matrices (Equation 4, Equation 6), the generated eight connected second neighboring pixel matrices, and the generated four connected second neighboring pixel matrices. A seventh step of forming one or more second matrices having a second direction coefficient indicating a vertical direction by using Equation 11 and Equation 12, and the second reference to be interpolated based on the plurality of second matrices. Eighth step of performing a second image interpolation for the pixel (Equation 10), and by reflecting the second reference pixel subjected to the second image interpolation by the vertical interpolation method to the input image data to improve the sharpness of the output image Ninth stage It includes.

입력된 이미지 데이터에서 손실된 픽셀의 보간 시 어두운 점의 경우, 상기 제1 단계, 상기 제2 단계, 상기 제3 단계, 상기 제4 단계, 상기 제5 단계의 대각선 보간법을 수행하며, 회색 점의 경우, 상기 제6 단계, 상기 제7 단계, 상기 제8 단계, 상기 제9 단계의 수직 보간법을 수행한다.In the case of dark points during interpolation of lost pixels in the input image data, diagonal interpolation of the first step, the second step, the third step, the fourth step, and the fifth step is performed, and In this case, the vertical interpolation of the sixth, seventh, eighth, and ninth steps is performed.

도 6은 본 발명의 실시예에 따른 데이터와 CUDA 스레드 간의 대응 관계를 나타낸 도면이고, 도 7은 샘플 이미지 데이터의 블록 크기 간 속도를 비교한 결과를 나타낸 도면이다.FIG. 6 is a diagram illustrating a correspondence relationship between data and a CUDA thread according to an exemplary embodiment of the present invention, and FIG. 7 is a diagram illustrating a result of comparing speeds between block sizes of sample image data.

C. CUDA 기반한 최적화C. CUDA-based Optimization

1) 공유 메모리 및 레지스터를 이용한 최적화1) Optimization using shared memory and registers

GPU 메모리는 CUDA 최적화 프로세스에서 중요한 역할을 한다. 전역 메모리에서 데이터를 읽고 쓰는 것과 관련된 대기 시간은 일부 대규모 CUDA 종속 프로그램의 병목 현상이다.GPU memory plays an important role in the CUDA optimization process. The latency associated with reading and writing data from global memory is a bottleneck for some large CUDA dependent programs.

전역 메모리와 비교할 때 공유 메모리와 레지스터는 높은 메모리 대역폭과 낮은 액세스 지연을 갖는다. 보다 빠른 액세스를 위해 HR 이미지에서 다운 샘플링 된 원본 LR 이미지 데이터가 먼저 공유 메모리로 변환된다.Compared with global memory, shared memory and registers have high memory bandwidth and low access delay. For faster access, the original LR image data downsampled from the HR image is first converted to shared memory.

데이터와 CUDA 스레드 간의 대응 관계는 도 6과 같다.The correspondence between data and CUDA threads is shown in FIG. 6.

도 6에 도시된 바와 같이, 본 발명은 각 블록의 데이터를 공유 메모리에 복사한다. 각 블록의 이미지 데이터는 먼저 _shared__ Image[13][13]에 복사된다.As shown in Fig. 6, the present invention copies the data of each block into the shared memory. The image data of each block is first copied to _shared__ Image [13] [13].

이러한 데이터에 따르면 64 픽셀들은 블록 크기가 8 × 8인 경우에 예측될 것이다. 경계를 형성하지 않으려면 블록 사이에 데이터 중복이 발생한다.According to this data 64 pixels would be predicted if the block size is 8 × 8. To avoid forming a boundary, data duplication occurs between blocks.

공유 메모리를 레지스터로 조정하는 것은 CUDA 프로그램의 고전적인 최적화 기술이다.Adjusting shared memory to registers is a classic optimization technique for CUDA programs.

동일한 블록에 있는 모든 스레드(Threads)는 서로 통신할 수 있으므로 모든 스레드가 하나의 블록에서 LR 데이터를 공유할 수 있다. 그러나 공유 메모리의 제한된 공간과 뱅크 충돌(Bank Conflict)을 고려해야 한다.All threads in the same block can communicate with each other so that all threads can share LR data in one block. However, you must consider the limited space of the shared memory and the Bank Conflict.

모든 블록에서 공유 메모리의 가장 큰 크기는 49,152 바이트이다. 따라서 n개의 주소를 가진 메모리와 n개의 개별 메모리 뱅크에 속하는 모든 읽기 또는 쓰기 요청은 동시에 액세스 할 수 있으며, 가능한 한 많은 뱅크 충돌을 피하는 것이 중요하다.The largest size of shared memory in all blocks is 49,152 bytes. Therefore, all read or write requests belonging to n address memory and n individual memory banks can be accessed simultaneously, and it is important to avoid as many bank conflicts as possible.

레지스터 메모리는 특정 스레드와 관련하여 비공개이다. 즉, 레지스터에 뱅크 충돌이 없기 때문에 일반적으로 공유 메모리보다 빠르다. Tesla K80에서 L1 캐시와 공유 메모리는 각 SMX에서 128KB의 구성 가능한 온 칩 캐시를 공유한다.Register memory is private with respect to a particular thread. That is, it is generally faster than shared memory because there are no bank conflicts in registers. In Tesla K80, L1 cache and shared memory share 128KB of configurable on-chip cache on each SMX.

모든 SMX는 최대 크기 1.5MB의 L2 캐시를 공유한다. 본 발명에서 블록 당 사용 가능한 레지스터의 수는 65,536이고 각 레지스터 파일의 크기는 32비트이다.All SMX share an L2 cache with a maximum size of 1.5MB. In the present invention, the number of available registers per block is 65,536 and the size of each register file is 32 bits.

다음 최적화 과정에서, 작은 행렬에 대한 연산은 도 6에 도시된 바와 같이 레지스터에서 완료된다. 병렬 기법은 하나의 스레드를 사용하여 하나의 슬라이드 창을 처리하는 것이다. L1 캐시는 레지스터 유출을 처리하는 데 사용할 수 있기 때문에 다음 명령을 사용합니다.In the next optimization process, operations on small matrices are completed in registers as shown in FIG. The parallel technique is to use one thread to process one slide window. Because the L1 cache can be used to handle register leaks, use the following command:

본 발명은 "cudaFuncCachePerferL1" 명령어를 사용하여 L1 캐시를 더 설정하고, L1 캐시가 각 SMX에서 48KB로 설정됨을 의미한다. 데이터와 스레드 간의 대응 관계는 도 6에 나타나 있다. 레지스터와 공유 메모리를 조정하는 것은 CUDA 프로그램의 종래의 최적화 기법이다.The present invention means that the L1 cache is further set using the "cudaFuncCachePerferL1" instruction, and the L1 cache is set to 48 KB in each SMX. The correspondence between data and threads is shown in FIG. Adjusting registers and shared memory is a conventional optimization technique for CUDA programs.

본 발명은 Lena에 대하여 79.3의 중요한 속도 향상을 얻었다.The present invention achieved a significant speed improvement of 79.3 for Lena.

한 블록에는 8 × 8 스레드가 있으며, 한 블록에서 8 × 8 미지의 픽셀들을 처리해야 한다.One block has 8 × 8 threads, and one block must handle 8 × 8 unknown pixels.

각 미지의 픽셀을 위해서 6 × 6 슬라이드 윈도우를 만들 필요가 있으며, 한 블록의 이미지 데이터는 도 6에 도시된 바와 같이, 13 × 13 경계 중첩이 있다.It is necessary to create a 6x6 slide window for each unknown pixel, and one block of image data has a 13x13 boundary overlap, as shown in FIG.

4 × 4, 5 × 9, 1 × 14를 포함한 다양한 행렬 크기를 고려할 때, 본 발명의 이미지 자동 회귀 보간 방법은 세분화된 모델을 기반으로 매트릭스 크기가 고정되어 있지 않기 때문에 세밀한 스레드 분할이 매우 어려워지며, 후속 작업에서 고려될 것이다.Considering the various matrix sizes including 4 × 4, 5 × 9, and 1 × 14, the image autoregressive interpolation method of the present invention is very difficult to fine thread division because the matrix size is not fixed based on the granular model. Will be considered in subsequent work.

[표 4]는 공유 메모리를 사용하고 8 × 8 블록 크기를 레지스터로 등록한 후, 단일 스레드 C CPU 코드에 대한 시간 비교를 보여준다.Table 4 shows the time comparison for single-threaded C CPU code after using shared memory and registering an 8 × 8 block size as a register.

2) 가장 적합한 블록 크기는 Tesla K80의 공유 메모리 및 레지스터 리소스가 부족하고 값 비싼 것으로 알려져 있다. 따라서, 블록 당 스레드 수는 합리적이어야 하며, 블록 크기가 너무 작으면 자원이 완전히 활용되지 않을 수 있다.2) The most appropriate block size is known to be expensive and expensive for Tesla K80's shared memory and register resources. Therefore, the number of threads per block should be reasonable, and if the block size is too small, resources may not be fully utilized.

그러나 블록 크기가 너무 크면 병렬 처리 수준이 떨어진다. 일반적으로 CUDA 프로그램은 블록 당 스레드 수가 16의 배수일 때 좋은 성능을 나타낸다.However, if the block size is too large, the degree of parallelism will drop. In general, CUDA programs perform well when the number of threads per block is a multiple of 16.

그러나 최적의 블록 크기가 항상 동일하지 않으며 모든 GPU 리소스와 밀접한 관련이 있음을 고려해야 한다. 최고의 성능을 얻기 위해서는 다른 블록 크기를 사용하여 속도 향상과 관련하여 수많은 실험을 수행하였다. 도 7에서 볼 수 있듯이 128 블록 크기가 본 발명의 방법에서 최고 성능을 달성하였다. Lena의 경우 86.4, Peppers의 경우 83.9, Pentagon의 경우 87.6, Airport의 경우 87.2이다.However, it is important to consider that the optimal block size is not always the same and is closely related to all GPU resources. In order to achieve the best performance, numerous experiments have been conducted regarding speedup using different block sizes. As can be seen in Figure 7, 128 block size achieved the best performance in the method of the present invention. 86.4 for Lena, 83.9 for Peppers, 87.6 for Pentagon and 87.2 for Airport.

도 8은 본 발명의 테슬라 K80를 이용한 비동기 데이터 전송을 나타낸 도면이고, 도 9는 단일 GPU와 다중 GPU의 비교 결과를 나타낸 도면이다.FIG. 8 illustrates asynchronous data transmission using Tesla K80 according to the present invention, and FIG. 9 illustrates a comparison result between a single GPU and multiple GPUs.

3) 비동기 데이터 전송을 이용한 추가 최적화3) Further optimization using asynchronous data transfer

본 발명은 LR 이미지 데이터를 호스트 장치에서 GPU 장치로 복사한 다음 HR 이미지 데이터를 호스트 메모리에 다시 저장해야 하므로 이미지 데이터 볼륨이 클 때 상대적으로 큰 오버 헤드가 발생한다.In the present invention, since the LR image data needs to be copied from the host device to the GPU device and then the HR image data is stored in the host memory, a relatively large overhead occurs when the image data volume is large.

본 발명은 비동기 데이터 전송(ADT)을 사용하여 프로그램에서 호스트 장치 전송 시간을 효과적으로 숨길 수 있다.The present invention can effectively hide the host device transfer time in a program using asynchronous data transfer (ADT).

Tesla K80에는 호스트 - 디바이스 엔진(H2D 엔진)과 장치 - 호스트 엔진(D2H 엔진)의 두 가지 메모리 복사 엔진이 있다.Tesla K80 has two memory copy engines: host-device engine (H2D engine) and device-host engine (D2H engine).

이러한 설계를 사용하면 스트림을 통해 ADT를 관리 할 수 있다. 여기서, 스트림은 순서대로 실행되는 일련의 명령이다. LR 이미지 데이터는 4 개의 부분으로 나누어지고 이에 따라 4개의 스트림이 생성되며 하나의 스트림이 하나의 데이터를 담당한다. 데이터 전송 및 데이터 컴퓨팅은 두 개의 다른 엔진과 동시에 수행될 수 있다. 도 8에 도시된 바와 같이, 데이터량이 크지만 요구되는 계산이 복잡하지 않은 경우, ADT에 의한 최적화 효과가 현저해진다. 본 발명은 4 개의 스트림이 GPU에서 생성된다.This design allows you to manage the ADT through the stream. Here, a stream is a series of instructions that are executed in sequence. The LR image data is divided into four parts, and thus four streams are generated, and one stream is in charge of one data. Data transfer and data computing can be performed simultaneously with two different engines. As shown in Fig. 8, when the data amount is large but the required calculation is not complicated, the optimization effect by ADT becomes remarkable. In the present invention, four streams are generated in the GPU.

그러나 ADT는 모든 프로그램에서 사용하기에 적합하지 않다. 호스트와 장치 간의 전송 시간이 스트림의 시간 관리를 숨기기에 충분하지 않으면 ADT는 쓸모가 없다. 본 발명의 방법에서 데이터 전송 시간은 커널 실행 시간보다 작으며, 스트림을 사용할 때 새로운 에지를 미러링 해야한다. 이러한 경우 속도는 [표 5]에 도시된 바와 같이, Lena의 경우 88.3 속도를 약간 향상시킨다.However, ADT is not suitable for use in all programs. ADT is useless if the transmission time between the host and the device is not sufficient to hide the time management of the stream. In the method of the present invention, the data transfer time is less than the kernel execution time, and the new edge should be mirrored when using the stream. In this case the speed slightly improves 88.3 speed for Lena, as shown in Table 5.

4) Multi-GPU를 이용한 최적화4) Optimization using Multi-GPU

Tesla K80에는 2개의 GPU가 있다. 다중 GPU는 여러 컴퓨팅 장치가 있는 단일 노드 데스크톱 시스템에 적합하다. 따라서, 본 발명은 다중 GPU 기술이 제공하는 하드웨어 리소스를 완벽하게 사용할 수 있다.Tesla K80 has two GPUs. Multi-GPU is suitable for single node desktop systems with multiple computing devices. Thus, the present invention can fully utilize the hardware resources provided by multi-GPU technology.

하나의 GPU가 하나의 스트림에 의존하기 때문에 본 발명의 이미지 자동 회귀 보간 방법에서 두 개의 스트림을 생성한다.Since one GPU depends on one stream, the image autoregressive interpolation method of the present invention generates two streams.

각 GPU는 이미지 데이터의 절반을 처리하고 각각 다른 장치가 각 명령을 병렬로 실행한다. 최종 최적화한 후, 스피드 업 성능은 [표 6]과 같이, Lena 이미지의 경우 최대 147.3배까지 성능이 크게 향상되었다. 최종 최적화한 후, PSNR에는 거의 변화가 없으며 Lena에 대한 0.01dB의 감소만 발생한다. 도 9는 단일 GPU(ADT 없는)와 다중 GPU의 비교를 보여준다. 하나의 GPU에 비해 멀티 GPU는 분명한 장점이 가진다.Each GPU processes half of the image data, and different devices execute each instruction in parallel. After the final optimization, the speed-up performance is greatly improved by up to 147.3 times for Lena images as shown in [Table 6]. After final optimization, there is little change in PSNR and only a 0.01dB reduction for Lena occurs. 9 shows a comparison of a single GPU (without ADT) and multiple GPUs. Multi-GPU has a distinct advantage over one GPU.

이상에서 본 발명의 실시예는 장치 및/또는 방법을 통해서만 구현이 되는 것은 아니며, 본 발명의 실시예의 구성에 대응하는 기능을 실현하기 위한 프로그램, 그 프로그램이 기록된 기록 매체 등을 통해 구현될 수도 있으며, 이러한 구현은 앞서 설명한 실시예의 기재로부터 본 발명이 속하는 기술분야의 전문가라면 쉽게 구현할 수 있는 것이다.The embodiments of the present invention are not only implemented through the apparatus and / or method, but may be implemented through a program for realizing a function corresponding to the configuration of the embodiment of the present invention, a recording medium on which the program is recorded, and the like. Such implementations can be easily implemented by those skilled in the art to which the present invention pertains based on the description of the above-described embodiments.

이상에서 본 발명의 실시예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다.Although the embodiments of the present invention have been described in detail above, the scope of the present invention is not limited thereto, and various modifications and improvements of those skilled in the art using the basic concepts of the present invention defined in the following claims are also provided. It belongs to the scope of rights.

100: 이미지 자동 회귀 보간 장치
110: 이미지 입력부
120: 이웃 픽셀 행렬 생성부
130: 방향 계수 행렬 생성부
140: 이미지 보간 계산부
150: 이미지 출력부100: image auto-regressive interpolation device
110: image input unit
120: neighbor pixel matrix generator
130: direction coefficient matrix generator
140: image interpolation calculator
150: image output unit

Claims

Obtaining a low resolution image through down sampling from a high resolution image of the input image data;
Generating eight adjacent neighboring pixel matrices and four connected neighboring pixel matrices centered on the reference pixel to be interpolated in the obtained low resolution image;
Constructing at least one first matrix having the generated eight connected neighboring pixel matrices and the generated first connected neighboring pixel matrix having a first direction coefficient indicating a specific direction;
Performing first image interpolation for the reference pixel to be interpolated based on the plurality of first matrices; And
Improving the sharpness of the output image by reflecting the reference pixel subjected to the first image interpolation to the input image data;
Performing the image interpolation,
And performing image interpolation on the reference pixel by Equation 1 and Equation 2 below.
[Equation 1]

Where I ₉ is a 9 × 9 identity matrix.
[Equation 2]

here,

, Parameter λ is the Lagrangian Factor.

A first step of obtaining a low resolution image through down sampling from a high resolution image of the input image data;
A second step of generating eight connected first neighboring pixel matrices positioned in diagonal directions adjacent to the first reference pixel to be interpolated in the obtained low resolution image and four connected first neighboring pixel matrices;
A third step of constructing at least one first matrix having the eight connected first neighboring pixel matrices and the first direction coefficients representing the diagonal directions by using the generated four connected first neighboring pixel matrices;
Performing a first image interpolation on the first reference pixel to be interpolated based on the plurality of first matrices;
A fifth step of improving the sharpness of the output image by reflecting the first reference pixel on which the first image interpolation has been performed by the diagonal interpolation method to the input image data;
Generating an eight connected second neighboring pixel matrix located in a vertical direction adjacent to the second reference pixel to be interpolated in the obtained low resolution image, and a four connected second neighboring pixel matrix;
A seventh step of constructing one or more second matrices having the eight connected second neighboring pixel matrices and the second direction coefficients representing the vertical directions by using the generated four connected second neighboring pixel matrices;
An eighth step of performing a second image interpolation on the second reference pixel to be interpolated based on the plurality of second matrices; And
And a ninth step of improving the sharpness of the output image by reflecting the second reference pixel on which the second image interpolation has been performed by the vertical interpolation method to the input image data,
Performing the image interpolation,
And performing image interpolation on the reference pixel by Equation 3 and Equation 4 below.
[Equation 3]

Where I ₉ is a 9 × 9 identity matrix.
[Equation 4]

here,

, Parameter λ is the Lagrangian Factor.

The method according to claim 1 or 2,
Generating the eight connected neighbor pixel matrix,
The eight connected neighbor pixel matrix further comprises the step of generating by the following equation (5).
[Equation 5]

here,

Is a data vector containing M ² pixels inside the local window, and A is the four 8-connected neighbors of xi

And M ² × 4 matrix of the i ^th row vector, t = {1,2,3,4} and Im M = 4, including.

The method according to claim 1 or 2,
Generating the four connected neighbor pixel matrix,
And generating the four connected neighbor pixel matrices by Equation 6 below.
[Equation 6]

here,

Is a data vector containing M ² pixels inside the local window, and B is the four 4-connected neighbor of xi

delete

The method of claim 2,
The performing of the second image interpolation may include:
And a first reference pixel which has performed the first image interpolation by the diagonal interpolation method, is regarded as a known pixel, and used when performing the second image interpolation.

The method of claim 2,
Performing dark interpolation of the pixels lost from the input image data, performing diagonal interpolation of the first step, the second step, the third step, the fourth step, and the fifth step;
In the case of gray dots, performing the vertical interpolation of the sixth step, the seventh step, the eighth step, and the ninth step.

An image input unit which obtains a low resolution image through down sampling from a high resolution image of the input image data;
A neighboring pixel matrix generator for generating eight adjacent neighboring pixel matrices and four connected neighboring pixel matrices centered on the reference pixel to be interpolated in the obtained low resolution image;
A direction coefficient matrix generator configured to form one or more first matrices having the eight connected neighbor pixel matrices and the first direction coefficients indicating a specific direction using the generated four connected neighbor pixel matrices;
An image interpolation calculator configured to perform first image interpolation on the reference pixels to be interpolated based on the plurality of first matrices; And
It includes an image output unit for improving the sharpness of the output image by reflecting the reference pixel subjected to the first image interpolation to the input image data,
And image interpolation for the reference pixel using Equation 1 and Equation 2 below.
[Equation 1]

Where I ₉ is a 9 × 9 identity matrix.
[Equation 2]

here,

, Parameter λ is the Lagrangian Factor.

The method of claim 8,
And the eight connected neighbor pixel matrices are generated by Equation 3 below.
[Equation 3]

here,

The method of claim 8,
The four connected neighbor pixel matrix is generated by the following equation (4).
[Equation 4]

here,

delete