KR102467673B1

KR102467673B1 - Deep Iterative Frame Interpolation Based Video Stabilization Method

Info

Publication number: KR102467673B1
Application number: KR1020200090250A
Authority: KR
Inventors: 권인소; 최진수
Original assignee: 한국과학기술원
Priority date: 2020-07-21
Filing date: 2020-07-21
Publication date: 2022-11-17
Anticipated expiration: 2040-07-21
Also published as: KR20220011398A

Abstract

심층 반복 프레임 보간 기반 비디오 안정화 방법이 개시된다. 본 발명은 이웃하는 제1 및 제2 원본 프레임을 보간하여 상기 제1 및 제2 원본 프레임의 중간 프레임을 합성하는 단계; 및 상기 '제1 및 제2 원본 프레임의 중간 프레임을 합성하는 단계'를 연속적인 원본 비디오 프레임들에 대하여 순차적으로 수행하는 단계를 포함하여 비디오 프레임들을 안정화시킨다. 영상 흐려짐 누적을 방지하기 위해, 상기 중간 프레임에 대응하는 제3 원본 프레임을 프레임 보간이 반복될 때 마다(every iteration) 사용하여 상기 중간 프레임에 포함된 미세한 세부 오류를 보정하여 최종 중간 프레임을 생성할 수 있다. 본 발명의 방법은 잘라내기를 통한 원래 비디오의 정보 손실에 따른 비디오 왜곡을 초래하지 않고, 선별된 데이터를 만들기 위한 라벨링 작업도 필요 없다. 반복적 보간에 따른 영상의 흐려짐 누적을 막을 수도 있다.A video stabilization method based on deep repeating frame interpolation is disclosed. Interpolating adjacent first and second original frames to synthesize intermediate frames of the first and second original frames; and sequentially performing the 'synthesizing intermediate frames of the first and second original frames' on successive original video frames to stabilize the video frames. In order to prevent the accumulation of image blur, a third original frame corresponding to the intermediate frame is used every time frame interpolation is repeated, and a fine detail error included in the intermediate frame is corrected to generate a final intermediate frame. can The method of the present invention does not cause video distortion due to loss of information of the original video through clipping, and does not require labeling to create selected data. It is also possible to prevent the accumulation of blurring of an image due to iterative interpolation.

Description

Deep Iterative Frame Interpolation Based Video Stabilization Method

본 발명은 비디오 안정화 기술 분야에 관한 것으로, 보다 상세하게는 프레임 보간을 이용하여 비디오의 왜곡 없이 비디오를 안정화 시킬 수 있는 방법에 관한 것이다.The present invention relates to the field of video stabilization technology, and more particularly, to a method for stabilizing a video without video distortion using frame interpolation.

점점 더 많은 양의 비디오가 웹에 업로드되고 있다. 비디오가 급성장하면서, 인터넷 통신량에서 비디오가 차지하는 비중은 2022년까지 전체의 82% 정도가 될 전망이다. 그에 따라 고품질 비디오가 점점 더 중요해지고 수요가 높아지고 있다. 시각적 안정성은 특히 핸드헬드 및 흔들리는 화면(footage)에 적합한 고품질 비디오의 필수 요소이다. More and more videos are being uploaded to the web. Video is growing rapidly, and its share of Internet traffic is expected to account for 82% of the total by 2022. Accordingly, high-quality video is becoming increasingly important and in demand. Visual stability is essential for high-quality video, especially for handheld and shaky footage.

비디오의 중요성, 활용성이 점점 더 중요해지고 있다. 촬영 중인 카메라의 흔들림으로 인해 촬영된 비디오가 임의의 방향으로 흔들리거나 또는 초점이 맞지 않지 않거나 움직임이 빨라서 영상이 흐릿해질(blur) 수 있다. 이런 불안정한 비디오를 흔들림과 흐려짐이 없는 깨끗한 비디오로 바꾸어주는 기술이 바로 비디오 안정화(video stabilization)이다. 비디오의 시각적 안정화는 여러 비디오 관련 산업에서 고화질 비디오를 위한 기본적이며 필수적으로 필요한 중요한 기술이다. 비디오 안정화는 YouTube와 같은 상업용 플랫폼 및 Adobe Premiere와 같은 편집 소프트웨어에 바람직한 기술이다. 실제로 YouTube, Adobe, Netflix, Vimeo, Samsung 등 유수의 기업들이 비디오 안정화를 비롯한 기술을 상용화하고 있고 여러 방면으로 활용하고 있다. The importance and usability of video is becoming more and more important. The recorded video may be shaken in a random direction due to the shaking of the camera during shooting, or the image may be blurry due to out-of-focus or fast movement. Video stabilization is a technology that transforms unstable video into clear video without shaking or blurring. Visual stabilization of video is a fundamental and essential technology for high-definition video in many video-related industries. Video stabilization is a desirable technique for commercial platforms such as YouTube and editing software such as Adobe Premiere. In fact, leading companies such as YouTube, Adobe, Netflix, Vimeo, and Samsung are commercializing technologies including video stabilization and are using them in various ways.

대부분의 기존 비디오 안정화 방법은 최적화 혹은 지도 학습을 통해 흔들리는 비디오를 안정화한다. 대개의 경우, 사후 처리로 비디오를 오프라인에서 안정화한다. 오프라인 방법은 일반적으로 온라인 방법에 비해 더 나은 안정화 결과를 보여 주었고, 그 중에서도 딥 러닝 기반 접근법은 유망한 품질을 보여주었다. 딥 러닝 기반 방법은 불안정한 (흔들림)(unstable (shaky)) 비디오 데이터와 안정적인 (움직임 부드럽게) (stable (motion smoothed) 비디오 데이터의 쌍 데이터를 필요로 하는 지도 학습 접근 방식(supervised approach)을 취한다. 따라서 그 방법들은 동일한 장면을 동시에 캡처하는 불안정한 카메라와 안정적인 카메라에 의해 수집된 데이터 세트를 사용한다. Most existing video stabilization methods stabilize shaky video through optimization or supervised learning. In most cases, post-processing stabilizes the video offline. Offline methods generally showed better stabilization results compared to online methods, and among them, deep learning-based approaches showed promising qualities. Deep learning-based methods take a supervised approach that requires paired data of unstable (shaky) and stable (motion smoothed) video data. Thus, the methods use data sets collected by an unstable camera and a stable camera simultaneously capturing the same scene.

그런데 딥 러닝 방법을 포함한 기존의 대부분의 비디오 안정화 방법은 안정화 과정에서 비디오의 가장자리가 비어 보이는 현상 즉, 일시적인 뷰 누락(missing view)이 발생하기 때문에 안정화 된 비디오에 대한 후처리로서 가장자리 부분(프레임 경계)을 잘라낸다(cropping). 카메라 흔들림은 인접한 프레임(f_i, f_i+1)과 대조되는 경우 프레임 경계에서 일시적으로 내용이 누락될 수 있다. 그러므로 기존의 대부분의 비디오 안정화 방법들은 도 1의 (b)에서 예시된 것처럼 마지막 단계 처리에서 빨간색 점선으로 표시된 프레임 경계(20)의 바깥 부분을 잘라내는 것이다. However, most of the existing video stabilization methods, including deep learning methods, cause the edges of the video to look empty during the stabilization process, that is, temporary missing views . ) is cropped. Camera shake can cause temporary content loss at frame boundaries when contrasted with adjacent frames (fi , f _i ₊₁ ). Therefore, in most existing video stabilization methods, as shown in FIG.

프레임 경계(20) 바깥 부분을 잘라내면 그 만큼 원본 콘텐츠의 손실과 비디오의 불가피한 확대 효과를 유발하게 된다. 원래 비디오의 일부인 프레임 경계를 잘라내므로 원치 않은 정보 손실을 일으켜 비디오의 왜곡을 유발한다. 그래서 원본 비디오보다 적은 영역만 남게 되고 결과적으로 확대된 비디오(zoom-in effect)가 나오게 된다. 또한, 안정화 품질이 떨어져 뒤틀림 현상도 보이게 된다. 비디오 안정화 처리에서 고품질의 비디오를 얻기 위해서는 과도한 비디오 잘라내기를 줄일 필요가 있다. 또한, 기존의 비디오 안정화 기술들은 지도 학습법을 따르므로, 지도 학습에 필요한 라벨링 된 선별된 데이터가 필요하다.If the portion outside the frame boundary 20 is cut off, the loss of the original content and the unavoidable enlargement effect of the video will be caused. It cuts off frame boundaries that are part of the original video, resulting in unwanted loss of information, leading to distortion of the video. So only a smaller area than the original video is left, resulting in a zoomed-in effect. In addition, the stabilization quality is lowered, and a distortion phenomenon is also seen. In video stabilization processing, it is necessary to reduce excessive video clipping to obtain high quality video. In addition, since existing video stabilization technologies follow supervised learning, labeled and selected data required for supervised learning is required.

(1) Liu, S., Yuan, L., Tan, P., and Sun, J. Bundled camera paths for video stabilization. ACM Transactions on Graphics (TOG)32, 4 (2013), 78. (Bundled 프로그램)(1) Liu, S., Yuan, L., Tan, P., and Sun, J. Bundled camera paths for video stabilization. ACM Transactions on Graphics (TOG) 32, 4 (2013), 78. (Bundled Program) (2) Grundmann, M., Kwatra, V., and Essa, I. Auto-directed video stabilization with robust l1 optimal camera paths. In IEEE Computer Vision and Pattern Recognition (CVPR). 225??232. (2011) (Adobe Premiere Pro CC 2017, Robust L1 프로그램)(2) Grundmann, M., Kwatra, V., and Essa, I. Auto-directed video stabilization with robust l1 optimal camera paths. In IEEE Computer Vision and Pattern Recognition (CVPR). 225??232. (2011) (Adobe Premiere Pro CC 2017, Robust L1 program) (3) Liu, S., Tan, P., Yuan, L., Sun, J., and Zeng, B. Meshflow: Minimum latency online video stabilization. In European Conference on Computer Vision (ECCV).800-815. (2016) (MeshFLow 프로그램)(3) Liu, S., Tan, P., Yuan, L., Sun, J., and Zeng, B. Meshflow: Minimum latency online video stabilization. In European Conference on Computer Vision (ECCV).800-815. (2016) (MeshFLow program) (4) Wang, M., Yang, G. Y., Lin, J. K., Ariel Shamir, Song-Hai Zhang, Shao-Ping Lu, and Shi-Min Hu. Deep Online Video Stabilization with Multi-Grid Warping Transformation Learning. IEEE Transactions on Image Processing (TIP)(2018). (StabNet 프로그램)(4) Wang, M., Yang, G. Y., Lin, J. K., Ariel Shamir, Song-Hai Zhang, Shao-Ping Lu, and Shi-Min Hu. Deep Online Video Stabilization with Multi-Grid Warping Transformation Learning. IEEE Transactions on Image Processing (TIP) (2018). (StabNet program) (5) Liu, S., Yuan, L., Tan, P., and Sun, J. Steadyflow: Spatially smooth optical flow for video stabilization. In IEEE Computer Vision and Pattern Recognition (CVPR). 4209-4216. (2014)(5) Liu, S., Yuan, L., Tan, P., and Sun, J. Steadyflow: Spatially smooth optical flow for video stabilization. In IEEE Computer Vision and Pattern Recognition (CVPR). 4209-4216. (2014) (6) Liu, S., Li, M., Zhu, S., and Zeng, B. CodingFlow: enable video coding for video stabilization. IEEE Transactions on Image Processing (TIP) 26, 7 (2017), 3291-3302.(6) Liu, S., Li, M., Zhu, S., and Zeng, B. CodingFlow: enable video coding for video stabilization. IEEE Transactions on Image Processing (TIP) 26, 7 (2017), 3291-3302.

본 발명의 일 목적은 잘라내기를 통한 원래 비디오의 정보 손실에 따른 비디오 왜곡을 초래하지 않고, 선별된 데이터를 만들기 위한 라벨링 작업이 필요 없는 비지도 딥러닝 학습과 비디오 프레임의 반복적 보간에 기반하여 비디오 안정화 방법을 제공하는 것이다.One object of the present invention is video stabilization based on unsupervised deep learning learning and iterative interpolation of video frames that does not cause video distortion due to information loss of the original video through cropping and does not require labeling work to create selected data. is to provide a way

본 발명의 일 목적은 연속적인 원본 비디오 프레임에 대하여 순차적으로 프레임 보간을 심층 반복적으로 수행하여 비디오 안정화를 강화하면서도, 반복적 보간에 따른 영상의 흐려짐 누적을 막을 수 있는 비디오 안정화 방법을 제공하는 것이다. One object of the present invention is to provide a video stabilization method capable of preventing accumulation of image blur due to iterative interpolation while enhancing video stabilization by sequentially performing deep iterative frame interpolation on successive original video frames.

본 발명이 해결하고자 하는 과제는 상술한 과제들에 한정되는 것이 아니며, 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위에서 다양하게 확장될 수 있을 것이다.The problem to be solved by the present invention is not limited to the above problems, and can be expanded in various ways without departing from the spirit and scope of the present invention.

상기 본 발명의 목적을 실현하기 위한 실시예들에 따른 비디오 안정화 방법은 컴퓨터 장치에서 컴퓨터 프로그램을 실행하여 수행되는 방법으로서, 이웃하는 제1 및 제2 원본 프레임을 보간하여 상기 제1 및 제2 원본 프레임의 중간 프레임을 합성하는 단계; 및 상기 '제1 및 제2 원본 프레임의 중간 프레임을 합성하는 단계'를 연속적인 원본 비디오 프레임들에 대하여 순차적으로 수행하는 단계를 포함하여 비디오 프레임들을 안정화시키는 것을 특징으로 한다. A video stabilization method according to embodiments for realizing the object of the present invention is a method performed by executing a computer program in a computer device, and interpolates neighboring first and second original frames to obtain the first and second original frames. synthesizing intermediate frames of frames; and sequentially performing the 'synthesizing intermediate frames of the first and second original frames' with respect to successive original video frames to stabilize the video frames.

예시적인 실시예들에 있어서, 상기 중간 프레임(f_int)을 합성하는 단계는, 상기 제1 및 제2 원본 프레임 간의 양방향 광학 흐름을 추정하는 단계; 추정된 광학 흐름에 기초하여 상기 제1 및 제2 원본 프레임을 서로를 향해 서로 간의 중간 위치로(halfway) 워핑(warping)하여 제1 및 제2 워핑 프레임을 생성하는 단계; 및 상기 제1 및 제2 워핑 프레임을 합성곱 신경망을 통해 학습하여 상기 제1 및 제2 원본 프레임의 중간 프레임을 추정하여 생성하는 단계를 포함할 수 있다.In example embodiments, the synthesizing of the intermediate frame (f _int ) may include estimating a bidirectional optical flow between the first and second original frames; generating first and second warped frames by warping the first and second original frames toward each other halfway based on the estimated optical flow; and estimating and generating intermediate frames of the first and second original frames by learning the first and second warping frames through a convolutional neural network.

예시적인 실시예들에 있어서, 상기 양방향 광학 흐름은 광학흐름 추정용 합성곱 신경망(CNN) 모델인 PWC-Net 모듈을 이용하여 추정될 수 있다.In example embodiments, the bidirectional optical flow may be estimated using a PWC-Net module, which is a convolutional neural network (CNN) model for estimating optical flow.

예시적인 실시예들에 있어서, 상기 중간 프레임은 합성곱 신경망 회로인 U-Net 모듈을 이용하여 추정될 수 있다.In exemplary embodiments, the intermediate frame may be estimated using a U-Net module, which is a convolutional neural network circuit.

예시적인 실시예들에 있어서, 제4항에 있어서, 상기 U-net 모듈은 다른 스케일의 상기 제1 및 제2 워핑 프레임을 서로 결합하는 방법을 학습하여 글로벌 저해상도 영상 정보로 국부적 고해상도 영상 정보를 예측하는 것을 통해 미세한 세부 정보(fine details)가 보존된 상기 중간 프레임을 추정할 수 있다.In example embodiments, the method of claim 4, wherein the U-net module learns how to combine the first and second warping frames of different scales to predict local high-resolution image information with global low-resolution image information. Through this, it is possible to estimate the intermediate frame in which fine details are preserved.

예시적인 실시예들에 있어서, 상기 비디오 안정화 방법은, 영상 흐려짐 누적을 방지하기 위해, 상기 중간 프레임에 대응하는 제3 원본 프레임을 프레임 보간이 반복될 때 마다(every iteration) 사용하여 상기 중간 프레임에 포함된 미세한 세부 오류를 보정하여 최종 중간 프레임을 생성하는 단계를 더 포함할 수 있다.In example embodiments, the video stabilization method uses a third original frame corresponding to the intermediate frame each time frame interpolation is repeated to prevent image blur from accumulating in the intermediate frame. The method may further include generating a final intermediate frame by compensating for included minute detail errors.

예시적인 실시예들에 있어서, 상기 최종 중간 프레임을 생성하는 단계는, 상기 중간 프레임을 참조 프레임으로 사용하여 상기 제1 및 제2 원본 프레임 사이의 상기 제3 원본 프레임을 상기 중간 프레임 쪽으로 워핑하는 단계; 워핑된 상기 제3 원본 프레임을 상기 중간 프레임과 함께 ResNet 모듈에 공급하는 단계; 및 상기 ResNet 모듈에서, 상기 워핑된 제3 원본 프레임을 사용하여 상기 중간 프레임에 포함된 미세한 세부 오류를 보정하여 최종 보간(중간) 프레임을 생성하는 단계를 포함할 수 있다.In example embodiments, the generating of the final intermediate frame may include warping the third original frame between the first and second original frames toward the intermediate frame using the intermediate frame as a reference frame. ; supplying the warped third original frame together with the intermediate frame to a ResNet module; and generating, in the ResNet module, a final interpolated (intermediate) frame by correcting minute detail errors included in the intermediate frame using the warped third original frame.

예시적인 실시예들에 있어서, 상기 비디오 안정화 방법은, 상기 '최종 보간프레임을 생성하는 단계'를 통해 생성되는 연속적인 비디오 프레임에 대하여, 상기 '중간 프레임을 합성하는 단계'와 상기 '순차적으로 수행하는 단계'를 적어도 1회 이상 반복 수행하는 심층 반복적 프레임 보간에 기반하여 비디오 프레임들의 시각적 안정성을 강화시키는 단계를 더 포함할 수 있다.In example embodiments, the video stabilization method may sequentially perform the 'synthesizing intermediate frames' and the 'sequentially' with respect to the continuous video frames generated through the 'generating the final interpolated frame'. The method may further include reinforcing visual stability of the video frames based on deep iterative frame interpolation in which the step of 'doing' is repeatedly performed at least once.

예시적인 실시예들에 있어서, 상기 비디오 안정화 방법은, 상기 프레임 보간 시 보간에 사용할 프레임들을 조정하는 건너뛰기 매개변수(skip parameter)를 도입하여 그 건너뛰기 매개변수의 값을 원하는 값으로 설정함으로써 프레임 안정성을 조절하는 단계를 더 포함할 수 있다. In exemplary embodiments, the video stabilization method introduces a skip parameter for adjusting frames to be used for interpolation during the frame interpolation and sets the value of the skip parameter to a desired value, thereby A step of adjusting stability may be further included.

예시적인 실시예들에 있어서, 상기 비디오 안정화 방법은, 상기 제3 원본 프레임의 공간 변환 버전인 의사-실측 프레임(pseudo-ground truth frame)을 정의하고, 상기 제3 원본 프레임에 선행 및 후행하는 상기 제1 및 제2 원본 프레임을 상기 의사 실측 프레임 쪽으로 워핑하여(warping) 상기 의사 실측 프레임(f_s)을 재구성하는 단계를 통해 프레임 보간 모델을 훈련시키는 단계를 더 포함할 수 있다.In example embodiments, the video stabilization method may include defining a pseudo-ground truth frame, which is a spatial transformation version of the third original frame, and performing the preceding and following the third original frame. The method may further include training a frame interpolation model by warping first and second original frames toward the simulated real frame and reconstructing the pseudo real frame f _s .

예시적인 실시예들에 있어서, 상기 제3 원본 프레임의 공간 변환은 임의의 방향으로 임의의 스케일로 수행될 수 있으며, 상기 임의의 스케일은 프레임 폭의 1/8 이내의 값일 수 있다.In example embodiments, spatial transformation of the third original frame may be performed in an arbitrary direction and with an arbitrary scale, and the arbitrary scale may be a value within 1/8 of a frame width.

본 발명의 상기 목적을 달성하기 위하여, 위에서 열거된 예시적 실시예들에 따른 비디오 안정화 방법을 수행하기 위하여 컴퓨터 판독 가능한 기록 매체에 저장된 컴퓨터 실행가능 프로그램이 제공될 수 있다.In order to achieve the above object of the present invention, a computer executable program stored in a computer readable recording medium may be provided to perform the video stabilization method according to the exemplary embodiments listed above.

본 발명의 상기 목적을 달성하기 위하여, 위에서 열거된 예시적 실시예들에 따른 비디오 안정화 방법을 수행하기 위한 컴퓨터 프로그램이 기록된 컴퓨터 판독 가능한 기록 매체가 제공될 수 있다.In order to achieve the above object of the present invention, a computer readable recording medium in which a computer program for performing the video stabilization method according to the exemplary embodiments listed above is recorded may be provided.

본 발명의 예시적인 실시예들에 의하면, 딥 러닝 및 프레임 보간을 활용하여 비지도 학습을 통해 비디오 안정화를 도모할 수 있다. 이웃하는 프레임들 간 보간 기법을 사용하여 그 이웃 프레임들 간에 보간 프레임을 생성하여 프레임 간 지터를 줄일 수 있다. According to exemplary embodiments of the present invention, video stabilization may be achieved through unsupervised learning using deep learning and frame interpolation. Inter-frame jitter can be reduced by generating an interpolated frame between neighboring frames using an inter-neighboring frame interpolation technique.

생성된 보간 프레임을 더 활용하여 프레임 보간법을 심층 반복적인 방식으로 적용하면 비디오 안정화 효과는 더욱 강화될 수 있다. 그러한 심화 반복적 프레임 보간법(Deep Iterative FRame INTerpolation: DIFRINT)은 다음과 같은 효과를 가져다줄 수 있다. 첫째, 거의 실시간(15fps)으로 실행할 수 있고, 프레임 가장자리 부분의 자르기 없이 풀 프레임 비디오 안정화를 가능하게 한다. 둘째, 비지도 학습 방식으로 엔드 투 엔드 훈련(end-to-end training)이 가능하므로 실측 프레임(ground truth frame) 대응이 필요하지 않다. 셋째, 프레임 보간 기술을 통합하여 안정화 처리된 비디오 프레임에서 시각적 왜곡이 적다는 장점이 있다.The video stabilization effect can be further strengthened by further utilizing the generated interpolation frame and applying the frame interpolation method in a deep iterative manner. Such deep iterative FRame INTerpolation (DIFRINT) can bring about the following effects. First, it can run in near real-time (15 fps) and enables full-frame video stabilization without cropping at the edges of the frame. Second, since end-to-end training is possible with an unsupervised learning method, ground truth frame correspondence is not required. Third, there is an advantage of less visual distortion in a stabilized video frame by integrating frame interpolation technology.

도 1의 (a)는 불안정한 입력 비디오이고, (b)는 공간적으로 조정된 프레임과 그 뒤에 프레임 자르기 (빨간색 점선으로 표시)가 수반되는 크롭핑 기반 방법의 비디오 안정화 처리 결과를 예시한다.
도 2의 (a)는 본 발명의 예시적인 실시예에 따른 프레임 보간 기반 비디오 안정화 방법을 예시하고, (b)는 안정화 처리로 얻어는 풀-프레임 비디오를 예시한다.
도 3은 본 발명의 예시적인 실시예에 따라 인접한 두 비디오 프레임(f_i와 f_i+1)을 보간하여 생성한 실제 보간된 비디오 프레임(f_mid)을 예시한다.
도 4의 (A)와 (B)는 훈련과 테스트를 수행하는 동안의 DIFRINT 프레임워크를 각각 도시한다.
도 5는 본 발명의 예시적인 실시예에 따라 프레임 보간만(interpolation only method) 한 영상(빨간색 박스로 표시)과 프레임 보간 외에 흐려짐 누적 방지처리까지 더 하여 얻은 영상(파란색 박스로 표시)을 대비시켜 보여준다.
도 6은 반복적 프레임 보간이 시각적 안정성을 더욱 강화시킬 수 있음을 설명하기 위한 도면이다.
도 7은 프레임 보간을 1회 반복한 경우(상단)와 5회 반복한 경우(하단)의 프레임 세그먼트의 안정화된 시퀀스를 보여준다.
도 8은 본 발명의 예시적 실시예에 따른 심층 반복적 프레임 보간을 한 영상은 입력 영상에서 보이지 않는 영역까지 보여주는 것을 예시한다.
도 9는 본 발명의 예시적인 실시예에 따른 DIFRINT 프레임워크의 아키텍처를 나타낸다. U-Net 및 ResNet 아키텍처와 피드포워드 프로세스를 나타낸다.
도 10은 본 발명의 예시적인 실시예에 따른 DIFRINT 프레임워크를 이용한 비디오 안정화와 종래의 다른 공지 기술의 비디오 안정화 간의 성능을 비교한 결과를 보여준다.
도 11은 질적 평가를 위해 본 발명에 따른 비디오 안정화 방법과 최근의 공지 상용 비디오 안정화 프로그램들(StabNet, Bundled, SteadyFlow, CodingFlow)을 이용한 비디오 안정화 결과물을 나타낸다.
도 12는 본 발명에 따른 비디오 안정화 방법과 기존의 비디오 안정화 프로그램 StabNet을 통해 얻은 비디오 안정화 이미지를 대비시켜 보여준다.
도 13은 본 발명에 따른 비디오 안정화 방법과 기존의 비디오 안정화 프로그램의 6 개의 비디오 범주에 대한 성능 테스트 결과를 대비시켜 보여준다.Fig. 1 (a) is an unstable input video, and (b) illustrates the video stabilization result of a cropping-based method involving a spatially adjusted frame followed by frame cropping (indicated by a red dotted line).
Figure 2 (a) illustrates a frame interpolation-based video stabilization method according to an exemplary embodiment of the present invention, and (b) illustrates a full-frame video obtained by stabilization processing.
3 illustrates an actual interpolated video frame (f _mid ) generated by interpolating two adjacent video frames (fi and f _i ₊₁ ) in accordance with an exemplary embodiment of the present invention.
Figure 4 (A) and (B) show the DIFRINT framework during training and testing, respectively.
5 is a contrast between an image (indicated by a red box) obtained by frame interpolation only method according to an exemplary embodiment of the present invention and an image (indicated by a blue box) obtained by adding blurring accumulation prevention processing in addition to frame interpolation. show
6 is a diagram for explaining that iterative frame interpolation can further enhance visual stability.
7 shows stabilized sequences of frame segments when frame interpolation is repeated once (top) and when frame interpolation is repeated five times (bottom).
8 illustrates that an image subjected to deep iterative frame interpolation according to an exemplary embodiment of the present invention shows an invisible region in an input image.
9 shows the architecture of the DIFRINT framework according to an exemplary embodiment of the present invention. It represents the U-Net and ResNet architecture and the feedforward process.
FIG. 10 shows a performance comparison result between video stabilization using the DIFRINT framework according to an exemplary embodiment of the present invention and video stabilization of other conventionally known techniques.
11 shows video stabilization results using the video stabilization method according to the present invention and recently known commercial video stabilization programs (StabNet, Bundled, SteadyFlow, CodingFlow) for quality evaluation.
12 shows a contrast between a video stabilization method according to the present invention and a video stabilization image obtained through StabNet, an existing video stabilization program.
13 compares performance test results for six video categories of the video stabilization method according to the present invention and the existing video stabilization program.

이하, 첨부한 도면들을 참조하여, 본 발명의 바람직한 실시예를 보다 상세하게 설명하고자 한다. 도면상의 동일한 구성요소에 대해서는 동일한 참조부호를 사용하고 동일한 구성요소에 대해서 중복된 설명은 생략한다.Hereinafter, with reference to the accompanying drawings, preferred embodiments of the present invention will be described in more detail. The same reference numerals are used for the same components in the drawings, and redundant descriptions of the same components are omitted.

본문에 개시되어 있는 본 발명의 실시예들에 대해서, 특정한 구조적 내지 기능적 설명들은 단지 본 발명의 실시예를 설명하기 위한 목적으로 예시된 것이다. 본 발명의 실시예들은 다양한 형태로 실시될 수 있으며, 본문에 설명된 실시예들에 한정되는 것으로 해석되어서는 안 된다. 즉, 본 발명은 다양한 변경을 가할 수 있고, 여러 가지 형태를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 본문에 상세하게 설명하고자 한다. 그러나 이는 본 발명을 특정한 개시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. For the embodiments of the present invention disclosed in the text, specific structural or functional descriptions are illustrated only for the purpose of describing the embodiments of the present invention. Embodiments of the present invention may be embodied in various forms, and should not be construed as being limited to the embodiments described herein. That is, since the present invention can have various changes and various forms, specific embodiments are illustrated in the drawings and described in detail in the text. However, this is not intended to limit the present invention to a specific form disclosed, and should be understood to include all modifications, equivalents, and substitutes included in the spirit and scope of the present invention.

본 발명에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다. 또한, 제1, 제2 등의 용어는 다양한 구성요소들을 설명하는 데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. Terms used in the present invention are only used to describe specific embodiments, and are not intended to limit the present invention. Singular expressions include plural expressions unless the context clearly dictates otherwise. In this application, the terms "include" or "have" are intended to designate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, but one or more other features It should be understood that the presence or addition of numbers, steps, operations, components, parts, or combinations thereof is not precluded. Also, terms such as first and second may be used to describe various components, but the components should not be limited by the terms. These terms are only used for the purpose of distinguishing one component from another.

도 2의 (a)는 본 발명의 예시적인 실시예에 따른 프레임 보간을 이용하여 안정화된 비디오 프레임을 생성하는 방법을 개념적으로 예시하고, (b)는 그러한 안정화 처리로 얻어진 풀-프레임 비디오를 예시한다.Figure 2 (a) conceptually illustrates a method for generating stabilized video frames using frame interpolation according to an exemplary embodiment of the present invention, and (b) illustrates full-frame video resulting from such stabilization processing. do.

도 2를 참조하면, 프레임 보간이 어떻게 비디오 안정화를 이룰 수 있는지를 예시한다. 예시적인 실시예에 따른 비디오 안정화 방법은 비디오의 안정화 수단으로서 프레임 보간법을 사용한다. 연속적인 프레임들(x-축) (f_i, 단, i는 자연수)은 도 2에 예시되어 있듯이 공간 지터(y-축, 1차원 변위로 표현됨)를 가져 불안정할 수 있다((A) 참조). 이런 불안정을 해소하기 위해, 프레임 보간 기반 비디오 안정화 방법은 이웃하는 두 원본 프레임을 보간한 보간 프레임을 생성할 수 있는데, 이러한 보간 프레임 생성은 연속적인 프레임들(x-축) (f_i, 단, i는 자연수)에 대하여 순차적으로 수행될 수 있다. 프레임 보간 기반 비디오 안정화 방법은, 도 2의 (E)와 (F)에 예시되어 있듯이, 이웃하는 두 원본 프레임(f_i-1, f_i+1) 간의 보간 프레임(f'_i)을 생성할 수 있다. 여기서, 제1 원본 프레임 (f_i-1)은 제2 원본 프레임 (f_i+1)보다 시간적으로 앞서는 프레임이다. 또한, 본 명세서에서, '이웃하는' 두 원본 프레임의 의미는 두 프레임이 서로 바로 옆에 붙어 있는 경우와 두 프레임 사이에 다른 1개 이상의 소수 개의 프레임이 존재하는 경우를 모두 포함하는 의미이다. 계속해서, 그 보간 프레임(f'_i)와 원본 프레임(f_i+2) 간의 보간 프레임(f'_i+1)을 생성할 수 있다. 같은 방식으로 나머지 원본 프레임들에 대하여 순차적으로 비디오 프레임의 보간을 수행할 수 있다. 도 2의 (B)는 위와 같은 방식으로 전체 원본 프레임들(f_i, 단, i는 자연수)에 대하여 1단계의 프레임 보간을 수행하여 얻어진 상대적으로 안정화된 비디오 프레임들을 예시한다. 따라서 예시적 실시예에 따른 프레임 보간법은 프레임 경계들을 추가되는 필수 효과로서 보간하면서 프레임을 안정화시킬 수 있다. 1단계 프레임 보간을 통해 얻어진 안정화된 비디오 프레임들에 대하여 동일한 방식의 프레임 보간을 더 수행할 수 있다. 도 2의 (C)와 (D)는 프레임 보간을 총 2단계 및 3단계로 각각 수행하여 얻어진 비디오 프레임들을 예시한다. 보간 단계가 높을수록, 즉 도 2의 (A)에서 (D)로 가면서 비디오가 더욱 안정화되는 것을 확인할 수 있다. Referring to Figure 2, it illustrates how frame interpolation can achieve video stabilization. A video stabilization method according to an exemplary embodiment uses frame interpolation as a video stabilization means. Successive frames (x-axis) (f _i , where i is a natural number) may be unstable due to spatial jitter (y-axis, expressed as a one-dimensional displacement) as illustrated in FIG. 2 (see (A) ). In order to solve this _instability , the frame interpolation-based video stabilization method may generate an interpolation frame obtained by interpolating two neighboring original frames. i is a natural number) can be performed sequentially. The frame interpolation-based video stabilization method, as illustrated in (E) and (F) of FIG. 2 , generates an interpolated frame (f' _i ) between two neighboring original frames (f _i-1 and f _i+1 ). can Here, the first original frame (f _i-1 ) is a frame temporally preceding the second original frame (f _i+1 ). In addition, in this specification, the meaning of 'adjacent' two original frames includes both a case where two frames are right next to each other and a case where one or more other small number of frames exist between the two frames. Subsequently, an interpolation frame (f' _i+1 ) between the interpolation frame (f' _i ) and the original frame (f _i+2 ) may be generated. In the same way, video frame interpolation may be sequentially performed with respect to the remaining original frames. (B) of FIG. 2 illustrates relatively stabilized video frames obtained by performing one-step frame interpolation on all original frames (f _i , where i is a natural number) in the above manner. Accordingly, the frame interpolation method according to the exemplary embodiment may stabilize the frame while interpolating frame boundaries as an added necessary effect. Frame interpolation in the same manner may be further performed on stabilized video frames obtained through one-step frame interpolation. (C) and (D) of FIG. 2 illustrate video frames obtained by performing frame interpolation in two and three steps, respectively. It can be seen that the video becomes more stable as the interpolation step increases, that is, from (A) to (D) in FIG. 2 .

도 3은 본 발명의 예시적인 실시예에 따라 이웃하는 두 개의 원본 비디오 프레임(f_i와 f_i+1)을 보간하여 생성한 실제 보간된 비디오 프레임(f_mid)을 예시한다. 3 illustrates an actual interpolated video frame (f _mid ) generated by interpolating two neighboring original video frames (f _i and f _i+1 ) according to an exemplary embodiment of the present invention.

도 3을 참조하면, 인접된 두 프레임(f_i와 f_i+1) 간에 모든 픽셀들에 대하여 보간이 이루어질 수 있다. 예컨대 붉은 색 깃발은 그것의 프레임 간(inter-frame) 위치(붉은색 및 파란색 화살표) 사이의 중간 정도 위치(녹색 화살표)로 보간될 수 있다. 또한, 프레임 간 카메라 모션을 통한 보이지 않는 영역(선행 프레임 f_i에서)(30-1)과 보이는 영역(후행 프레임 f_i+1에서)(30-3) 간 보간을 통해 프레임 경계영역이 얻어질 수 있다. 즉, 보간된 비디오 프레임(f_mid)의 가장자리(30-2)는 그것의 양쪽 프레임(f_i와 f_i+1)을 보고 복원할 수 있다. Referring to FIG. 3 , interpolation may be performed on all pixels between two adjacent frames (f _i and f _i+1 ). For example, a red flag can be interpolated to an intermediate position (green arrow) between its inter-frame position (red and blue arrows). In addition, the frame boundary area can be obtained through interpolation between the invisible area (in the preceding frame f _i ) (30-1) and the visible area (in the following frame f _i+1 ) (30-3) through camera motion between frames. can That is, the edge 30-2 of the interpolated video frame f _mid can be reconstructed by looking at both frames f _i and f _i+1 thereof.

본 발명의 예시적인 실시예는 비지도 학습 방식으로 훈련될 수 있는 딥 프레임워크(deep framework)를 제공하므로 안정된 실측 데이터(ground truth data)를 요구하지 않는다. 본 발명의 비디오 안정화 방법은 프레임 보간법을 사용하여 프레임을 안정화시키므로, 프레임 경계 부분의 잘라내기(cropping)를 없앨 수 있다. 본질적으로, 본 발명의 딥 프레임워크는 두 개의 순차적 프레임의 "사이에" 중간 프레임(middle frame)을 생성하는 방법을 학습한다. 보간의 관점에서 볼 때, 합성된 중간(즉, 보간된) 프레임은 두 개의 순차적 프레임 사이에서 캡처되었을 수 있는 프레임을 나타낸다. 따라서 그 보간된 프레임은 시간적으로 중간 프레임을 나타내며, 프레임 간 모션의 정확히 절반에 포착된 것으로 가정할 수 있다. 결과적으로, 중간 프레임들의 순차적 생성은 도 2에 도시된 바와 같이 인접한 프레임들 사이의 공간 지터(spatial jitter)를 감소시킨다. 직관적으로, 프레임 보간은 공간 데이터 시퀀스들에 대한 시간 영역에서 선형 보간(로우 패스 필터)으로 생각할 수 있다. Exemplary embodiments of the present invention provide a deep framework that can be trained in an unsupervised learning manner and therefore does not require stable ground truth data. Since the video stabilization method of the present invention stabilizes a frame using a frame interpolation method, cropping of a frame boundary portion can be eliminated. In essence, the deep framework of the present invention learns how to create a middle frame "in between" two sequential frames. From an interpolation point of view, a synthesized intermediate (i.e., interpolated) frame represents a frame that may have been captured between two sequential frames. Thus, the interpolated frame represents an intermediate frame in time, and can be assumed to have been captured at exactly half of the inter-frame motion. As a result, sequential generation of intermediate frames reduces spatial jitter between adjacent frames as shown in FIG. 2 . Intuitively, frame interpolation can be thought of as a linear interpolation (low pass filter) in the time domain for spatial data sequences.

선형 보간을 여러 번 적용하면 안정화 효과가 더욱 강화될 수 있다. 공간 데이터 시퀀스들 즉, 프레임 시퀀스들의 경우, 프레임 보간은 본질적으로 인접하는 두 프레임(f_i, f_i+1) 간의 모든 픽셀에 대한 정확한 중간 지점을 추정하고 도 3에 예시된 것처럼 중간 프레임(f_mid)을 생성할 수 있다. 또한, 중간 프레임 합성의 중요한 장점은 프레임 간 카메라 동작 사이에 프레임 경계들이 합성되어 일시적으로 누락된 뷰(view)를 채워서, 풀 프레임 기능을 할 수 있게 해준다는 점이다. 또한 본 발명의 프레임 보간 기반 비디오 안정화 방법은 딥 아키텍처로 인해 빠른 피드포워드를 통해 거의 실시간 (15fps)으로 실행될 수 있는 장점도 있다.Applying linear interpolation multiple times can further enhance the stabilization effect. For spatial data sequences, i.e., frame sequences, frame interpolation essentially estimates the exact midpoint for every pixel between two adjacent frames (f _i , f _i+1 ) and, as illustrated in FIG. 3 , the intermediate frame (f _mid ) can be created. In addition, an important advantage of intermediate frame compositing is that frame boundaries are composited between frame-to-frame camera movements to temporarily fill in missing views, enabling full-frame functionality. In addition, the video stabilization method based on frame interpolation of the present invention has the advantage that it can be executed in almost real time (15 fps) through fast feed forward due to the deep architecture.

예시적인 실시예에 따른 프레임 보간 기반 비디오 안정화 방법의 핵심 아이디어는 주어진 비디오의 두 프레임 사이의 중간 프레임을 순차적으로 합성하여 시각적 안정성을 확보하는 것이다. 연속적인 프레임들에 대하여 인접 프레임을 이용하여 프레임 보간을 수행한다. 본 명세서 전체에서, '보간된 프레임'과 '중간 프레임'은 동의어이며 서로 바꿔 사용할 수 있다.The core idea of the frame interpolation-based video stabilization method according to the exemplary embodiment is to sequentially synthesize intermediate frames between two frames of a given video to ensure visual stability. For successive frames, frame interpolation is performed using adjacent frames. Throughout this specification, 'interpolated frame' and 'intermediate frame' are synonyms and can be used interchangeably.

예시적인 실시예에서, 프레임 보간을 여러 단계 반복적으로 적용하여 더 강력한 안정화 효과를 생성할 수 있다. 이를 위해, 심층 반복 프레임 보간(Deep Iterative FRame INTerpolation: DIFRINT) 아키텍쳐를 이용할 수 있다. In an exemplary embodiment, frame interpolation may be repeatedly applied in several steps to create a stronger stabilization effect. To this end, a Deep Iterative FRame INTerpolation (DIFRINT) architecture may be used.

도 4의 (A)와 (B)는 훈련과 테스트를 수행하는 동안의 DIFRINT 프레임워크를 각각 도시한다. 도 4에서, 훈련과 테스트의 차이는 빨간색으로 표시되어 있다. 고정된 구성 요소는 실선으로 표시되어 있고, 훈련 가능한 구성 요소는 점선으로 표시되어 있다. 적용된 손실은 녹색으로 표시된다.Figure 4 (A) and (B) show the DIFRINT framework during training and testing, respectively. In Figure 4, the difference between training and testing is marked in red. Fixed components are indicated by solid lines, and trainable components are indicated by dotted lines. Applied losses are displayed in green.

도 4의 (A)는 학습을 위한 모델이고, (B)는 테스트를 위한 모델이다. 학습을 할 때에는 의사 실측 프레임(f_s)라는 워핑'기준'이 주어지지만(자세한 사항은 후술함), 테스트를 할 때는 '중간' 이미지를 생성해야하기 때문에 주어진 두 이미지의 중간으로 워핑하기 위해서 각각 상수 0.5를 워핑의 정도에 곱하여 중간 이미지를 생성할 수 있다. 4 (A) is a model for learning, and (B) is a model for testing. During learning, a warping 'standard' called pseudo-real frame (f _s ) is given (details will be described later), but during testing, since an 'intermediate' image must be generated, each An intermediate image can be created by multiplying the degree of warping by a constant of 0.5.

도 4의 (B)에서 볼 수 있듯이, 이웃하는 두 원본 프레임(f_i-1, f_i+1)이 주어지면, 그 두 원본 프레임(f_i-1, f_i+1)은 1단계 워핑부(warping unit)(60)에 인가될 수 있다. 1단계 워핑부(60)의 제1 및 제2 워퍼(60-1, 60-2)는 두 원본 프레임(f_i-1, f_i+1) 간의 양방향 광학 흐름(bidirectional optical flow)을 추정하고, 그 추정된 광학 흐름에 기초하여 두 원본 프레임(f_i-1, f_i+1)을 서로를 향해 서로간의 중간 위치로(halfway)( 'x 0.5'로 표시) 워핑(warping. 이동시켜 틀어지게) 할 수 있다. 본 명세서에서, '중간위치(halfway)'는 공간적 중간을 의미하지만 이론상 시간적 중간과도 일치한다. 양방향 광학 흐름의 추정은 효과적인 광학흐름 추정용 CNN 모델로 알려진 PWC-Net를 이용하여 수행될 수 있다. PWC-Net는 피라미드식 처리, 워핑(warping), 그리고 코스트적 보다 구체적으로, 제1 워퍼(warper)(60-1)에 의해 제1 프레임(f_i-1)은 제2 프레임(f_i+1) 쪽으로 제2 프레임(f_i+1)까지의 거리의 중간 위치로 워핑되고, 제2 워퍼(60-2)에 의해 제2 프레임(f_i+1)은 제1 프레임(f_i-1) 쪽으로 제1 프레임(f_i-1)까지의 거리의 중간 위치로 워핑될 수 있다. 결과적으로, 1단계 워핑부(60)는 틀어진 두 워핑 프레임(f_w-, f_w+)을 생성할 수 있다. 두 워핑 프레임(f_w-, f_w+)은 두 원본 프레임(f_i-1, f_i+1)으로부터 각각 유래한 중간점들을 나타낸다.As shown in (B) of FIG. 4, given two original frames (f _i-1 and f _i+1 ) adjacent to each other, the two original frames (f _i-1 and f _i+1 ) are subjected to one-step warping. It may be applied to a warping unit (60). The first and second warpers 60-1 and 60-2 of the first-step warping unit 60 estimate the bidirectional optical flow between the two original frames (fi _-1 , f _i+1 ) and , based on the estimated optical flow, warping the two original frames (f _i-1 , f _i+1 ) toward each other halfway (denoted by 'x 0.5'). lose) can. In this specification, 'halfway' means a spatial halfway, but also corresponds to a temporal halfway in theory. Estimation of bidirectional optical flow can be performed using PWC-Net, which is known as a CNN model for effective optical flow estimation. PWC-Net is a pyramidal process, warping, and cost. More specifically, the first frame (f _i-1 ) is converted into the second frame (f _i+ ) by the first warper (60-1). ₁ ) toward the second frame (f _i+1 ), and is warped to an intermediate position of the distance to the second frame (f i+1 ), and the second frame (f _i+1 ) is transferred to the first frame (f _i-1 ) by the second warper 60-2. ) to the middle position of the distance to the first frame f _i−1 . As a result, the first-step warping unit 60 may generate two distorted warping frames f _w- and f _w+ . The two warping frames (f _w- , f _w+ ) represent midpoints derived from the two original frames (f _i-1 , f _i+1 ), respectively.

그런 다음, 그 워핑 프레임(f_w-, f_w+)은 합성곱 신경망(Convolutional Neural Network: CNN) 모듈(65)에 공급될 수 있다. 그 CNN 모듈은 예컨대 U-Net 모듈일 수 있다. U-Net 모듈(65)은 그 두 워핑 프레임(f_w-, f_w+)을 이용하여 중간 프레임(f_int)을 생성할 수 있다. U-Net 모듈(65)은 다른 스케일의 정보를 서로 결합하는 방법을 학습하여 글로벌 저해상도 영상 정보로 국부적 고해상도 영상 정보를 예측하도록 안내할 수 있다. 두 워핑 프레임(f_w-, f_w+)에는 구멍이나 보이지 않는 영역이 포함될 수 있음에 유의할 필요가 있다. 그렇기 때문에 두 워핑 프레임(f_w-, f_w+)이 서로를 보완하기 위해 사용되는 것이다. 이를 U-Net 모듈(65)과 같은 CNN 모듈이 다른 스케일의 상기 두 워핑 프레임(f_w-, f_w+)을 서로 결합하는 방법을 기본적으로 학습한다. 이를 통해 U-Net 모듈(65)은 글로벌 저해상도 영상 정보로 국부적 고해상도 영상 정보를 예측하여 미세한 세부 영상정보가 보존된 중간 프레임(f_int)을 추정할 수 있다.Then, the warped frames (f _w- , f _w+ ) may be fed to a Convolutional Neural Network (CNN) module 65. The CNN module may be, for example, a U-Net module. The U-Net module 65 may generate an intermediate frame (f _int ) using the two warping frames (f _w- , f _w+ ). The U-Net module 65 may learn how to combine information of different scales and guide local high-resolution image information to be predicted with global low-resolution image information. It should be noted that the two warping frames f _w- and f _w+ may include holes or invisible regions. That is why the two warping frames (f _w- , f _w+ ) are used to complement each other. A CNN module such as the U-Net module 65 basically learns how to combine the two warping frames (f _w- , f _w+ ) of different scales with each other. Through this, the U-Net module 65 can estimate local high-resolution image information with global low-resolution image information and estimate an intermediate frame (f _int ) in which fine detailed image information is preserved.

도 5는 본 발명의 예시적인 실시예에 따라 프레임 보간만(interpolation only method) 한 영상(빨간색 박스로 표시)과 프레임 보간 외에 흐려짐 누적 방지처리까지 더 하여 얻은 영상(파란색 박스로 표시)을 대비시켜 보여준다. 5 is a contrast between an image (indicated by a red box) obtained by frame interpolation only method according to an exemplary embodiment of the present invention and an image (indicated by a blue box) obtained by adding blurring accumulation prevention processing in addition to frame interpolation. show

도 5를 참조하면, 비디오 프레임을 다중 프레임 보간만 하는 방법은, 빨간색 박스로 확대 표시된 패치 새그먼트의 영상처럼, 다중 보간을 반복하는 동안 영상 흐려짐이 누적(blur accumulation)되는 경향이 있다. 이는 프레임 보간의 다중 반복으로 인해 영상의 미세한 세부 사항에 관한 정보 손실이 생기기 때문이다. 예시적인 실시예에서, 영상 흐려짐 누적을 방지하기 위해, 상기 중간 프레임에 대응하는 원본 프레임을 사용하여 상기 중간 프레임에 포함된 미세한 세부 오류를 보정하여 최종 중간 프레임을 생성할 수 있다.Referring to FIG. 5 , in a method of performing only multi-frame interpolation of a video frame, image blur tends to accumulate while repeating multi-interpolation, such as an image of a patch segment enlarged with a red box. This is because multiple iterations of frame interpolation result in loss of information about fine details of the image. In an exemplary embodiment, in order to prevent image blur from accumulating, a final intermediate frame may be generated by correcting minute detail errors included in the intermediate frame using an original frame corresponding to the intermediate frame.

예시적인 실시예에서, 그러한 영상 흐려짐 누적을 방지하기 위해, 중간 프레임(f_int)과 그것에 대응하는 원본 프레임(f_i)을 2단계 워핑부(70)에 공급할 수 있다. 2단계 워핑부(70)는 중간 프레임(f_int)을 참조 프레임으로 사용하여 원본 프레임(f_i)을 중간 프레임(f_int) 쪽으로 워핑한다. 2단계 워핑부(70)에 의해 워핑된 워핑 프레임은 중간 프레임(f_int)과 함께 또 다른 CNN 모듈, 예컨대 ResNet 아키텍처 모듈(75)에 공급될 수 있다. ResNet 아키텍처 모듈(75)은 영상의 미세한 세부 정보를 보유하고 있는 원본 프레임(f_i)을 사용하여 중간 프레임(f_int)에 포함된 미세한 세부 오류를 보정하여 최종 중간(보간) 프레임(

)을 생성할 수 있다. ResNet 아키텍처 모듈(75)은 잔차 학습 메커니즘(residual learning mechanism)을 통해 미세한 세부 사항에서 에러를 최소화하는 데 적합하다.In an exemplary embodiment, in order to prevent such image blur accumulation, an intermediate frame (f _int ) and an original frame ( _fi ) corresponding thereto may be supplied to the second-stage warping unit 70 . The second-step warping unit 70 warps the original frame ( _fi ) toward the intermediate frame (f _int ) using the intermediate frame (f _int ) as a reference frame. The warping frame warped by the second-stage warping unit 70 may be supplied to another CNN module, for example, the ResNet architecture module 75 together with an intermediate frame f _int . The ResNet architecture module 75 uses the original frame (f _i ), which holds the fine details of the image, to compensate for the fine detail errors contained in the intermediate frame (f _int ), so that the final intermediate (interpolated) frame (

) can be created. The ResNet architecture module 75 is suitable for minimizing errors in fine details through a residual learning mechanism.

이러한 프레임 보간법에 의하면, 원본 프레임(f_i)은 보간이 반복될 때 마다(every iteration) 사용될 수 있기 때문에, 도 5의 파란색 박스로 표시된 확대된 패치 세그먼트의 영상처럼, 영상의 시각적 디테일이 유지될 수 있다. 즉, 프레임 보간을 여러 번 반복하더라도 원본 프레임(f_i)은 미세한 세부 사항에 관한 정보를 포함하고 있으므로, 리샘플링 에러가 거의 발생하지 않는다.According to this frame interpolation method, since the original frame (f _i ) can be used every time interpolation is repeated (every iteration), the visual details of the image can be maintained, such as the image of the enlarged patch segment indicated by the blue box in FIG. can That is, even if frame interpolation is repeated several times, a resampling error hardly occurs because the original frame (f _i ) includes information on minute details.

본 발명의 예시적일 실시예에 따른 비디오 안정화 방법의 주요 이점은 딥 러닝 및 프레임 보간을 활용하여 비지도 학습(Unsupervised learning)을 통해 비디오 안정화를 해결한다는 것이다. A key advantage of the video stabilization method according to an exemplary embodiment of the present invention is that it utilizes deep learning and frame interpolation to solve video stabilization through unsupervised learning.

본 발명의 예시적인 비지도 학습 프레임워크의 목표는 오류 누적 없이 중간 프레임을 생성하는 것이다. 따라서 훈련 체계(Training scheme) 의 목표는 그러한 보간 품질을 생성하도록 프레임워크를 올바르게 훈련시키는 것이다. 훈련 체계를 이해하려면 먼저 테스트 체계를 이해할 필요가 있다. A goal of the exemplary unsupervised learning framework of the present invention is to generate intermediate frames without error accumulation. Therefore, the goal of the training scheme is to properly train the framework to produce such an interpolation quality. To understand the training system, you first need to understand the testing system.

도 4의 (B)를 참조하여 위에서 설명한 바와 같이, 예시적인 실시예에 따른 실제 프레임 보간은 두 원본 프레임(f_i-1, f_i+1)을 서로를 향해 중간 지점으로 워핑한 다음 그 워핑 프레임(f_w-, f_w+)을 U-Net 아키텍처 모듈(65)에 공급하여 중간 프레임(f_int)을 생성함으로써 구현될 수 있다. 에러나 영상 흐려짐(blur)을 방지하기 위해 중간 프레임(f_int)에 대응하는 원본 프레임(f_i)을 그 중간 프레임(f_int) 쪽으로 워핑하고, 그렇게 워핑된 프레임을 중간 프레임(f_int)과 함께 ResNet 아키텍처 모듈(75)에 공급하여 미세 세부 에러를 보정한 최종 중간(보간) 프레임(

)을 생성할 수 있다. As described above with reference to FIG. 4(B) , actual frame interpolation according to an exemplary embodiment warps the two original frames (f _i-1 , f _i+1 ) towards each other to an intermediate point and then warps them. It can be implemented by supplying the frames (f _w- , f _w+ ) to the U-Net architecture module 65 to generate an intermediate frame (f _int ). In order to prevent errors or image blur, the original frame (f _i ) corresponding to the intermediate frame (f _int ) is warped toward the intermediate frame (f _int ), and the warped frame is then warped to the intermediate frame (f _int ). The final intermediate (interpolated) frame (

) can be created.

그런데 이 모델을 훈련하려고 할 때 다음과 같은 점을 고려할 필요가 있다. 실측 중간 프레임(ground truth middle frame)이 존재하지 않을 수 있다. 원본 프레임(f_i)이 이웃하는 두 원본 프레임(f_i-1, f_i+1) 사이의 중간 지점이라는 것을 보장할 수 없기 때문에 그 원본 프레임(f_i)을 실측(ground truth)으로 사용할 수 없다. 따라서 최종 중간(보간) 프레임(

)은 손실을 비교하고 계산할 수 있는 실측 근거(ground truth)가 없다.However, when trying to train this model, we need to consider the following points. A ground truth middle frame may not exist. Since we cannot guarantee that the original frame (f _i ) is the midpoint between two neighboring original frames (f _i-1 and f _i+1 ), we cannot use the original frame (f _i ) as the ground truth. none. Thus, the final intermediate (interpolated) frame (

) has no ground truth to compare and calculate the loss.

이 점을 고려하여, 훈련이 수행될 필요가 있다. 도 4의 (A)에는 훈련 체계의 개요가 도시되어 있다. 예시적인 실시예에 있어서, 도 4의 (A)에 도시된 것처럼 원본 프레임(f_i)의 공간 변환 버전인 의사-실측 프레임(pseudo-ground truth frame)(f_s)을 정의할 수 있다. 공간 변환은 임의의 방향으로 임의의 작은 스케일로 수행될 수 있다. 그 임의의 스케일은 예컨대 프레임 폭의 최대 1/8일 수 있다. 이웃하는 두 프레임 (f_i-1, f_i+1)을 의사 실측 프레임(f_s) 쪽으로 워핑하여(warping) 의사 실측 프레임(f_s)을 재구성하는 것을 목표로 하는 것에 의해 훈련이 행해질 수 있다. Taking this into account, training needs to be conducted. 4(A) shows an outline of the training system. In an exemplary embodiment, a pseudo-ground truth frame (f _s ), which is a spatially transformed version of the original frame (f _i ), may be defined as shown in (A) of FIG. 4 . Spatial transformations can be performed in any direction and at any small scale. The arbitrary scale may be up to 1/8 of the frame width, for example. Training can be done by aiming to reconstruct the pseudo-true frame (f _s ) by warping two neighboring frames (f _i-1 , f _i+1 ) toward the pseudo-true frame (f _s ) .

이를 위해, 제1 원본 프레임 (f_i-1)과 의사 실측 프레임(f_s)이 1단계 워핑부(40)의 제1 워퍼(40-1)에 공급되고, 제2 원본 프레임 (f_i+1)과 의사 실측 프레임(f_s)이 제2 워퍼(40-1)에 공급될 수 있다. 두 개의 워핑 프레임(w_-, w₊)이 제1 워퍼(40-1) 및 제2 워퍼(40-2)를 통해 생성될 수 있다. 그 두 워핑 프레임(w_-, w₊)이 CNN 모듈인 U-Net(45)에 제공되면, U-Net(45)은 그 두 개의 워핑 프레임(w_-, w₊)의 의사 실측 프레임(f_s)을 재구성하는 법을 학습할 수 있다. 원본 프레임(f_i)의 실측(ground truth)을 가로, 세로로 픽셀 값들의 위치를 이동하여 의사 실측 프레임(f_s)을 위한 학습 데이터로 사용할 수 있다. ResNet(55)도 워핑된 원본 프레임(f_i)를 사용하여 동일한 방식으로 수행한다. 원래의 프레임(f_i)과 중간 프레임(f_int)이 제3 워퍼(50)에 공급되어 워핑된 프레임은 중간 프레임(f_int)과 함께 ResNet(55)에 공급되어 최종 중간(보간) 프레임(

)이 생성될 수 있다. 이러한 훈련 체계는 테스트 체계로 일반화될 수 있고, 이는 가상 중간 프레임(virtual middle frame)을 재구성하는 것으로 볼 수 있다. To this end, the first original frame (f _i-1 ) and the simulated frame (f _s ) are supplied to the first warper 40-1 of the first-stage warping unit 40, and the second original frame (f _{i+ 1} ) and the pseudo real measurement frame f _s may be supplied to the second warper 40-1. Two warping frames w _- and w ₊ may be generated through the first warper 40-1 and the second warper 40-2. When the two warping frames (w _- , w ₊ ) are provided to U-Net(45), which is a CNN module, U-Net(45) transmits the pseudo-real frame (f) of the two warping frames (w _- , w ₊ ). _s ) can be learned. The ground truth of the original frame (f _i ) may be used as training data for the pseudo ground truth frame (f _s ) by moving the positions of pixel values horizontally and vertically. ResNet (55) is also performed in the same way using the warped original frame ( _fi ). The original frame ( _fi ) and the intermediate frame (f _int ) are supplied to the third warper 50, and the warped frame is supplied to the ResNet 55 together with the intermediate frame (f _int ) to provide the final intermediate (interpolation) frame (

) can be created. This training scheme can be generalized to a test scheme, which can be viewed as reconstructing a virtual middle frame.

일단 훈련 체계가 의사 실측 프레임(f_s)을 재구성함으로써 모델을 적절히 훈련시키고 나면, 실제 프레임 보간은 테스트 체계(testing scheme)의 수행 동안 적용될 수 있다. 인접한 두 프레임 (f_i-1, f_i+1)은 도 4의 (B)에 도시된 바와 같이 서로를 향해 절반만큼(0.5 배) 워핑된다(즉, 이동하여 틀어진다). 연속 프레임을 서로를 향해 중간위치로 워핑하거나 중간 프레임을 예측하기 위해 학습하는 기술은 프레임 보간 작업에 사용되어 오고 있다. 실제로, 비디오 시퀀스가 주어지면, 모든 연속적인 프레임 트리플렛 (f_i-1, f_i, f_i+1)은 안정화된 프레임 출력을 생성하기 위해 예시적인 실시예에 따른 프레임워크에 입력으로 사용될 수 있다. 다만, 첫 번째 프레임과 마지막 프레임은 원래 그대로 유지된다. 이미 보간된 프레임에 본 발명의 비디오 안정화 방법을 반복적으로 적용하여 생성된 비디오 프레임을 더욱 안정화시키는 옵션이 제공될 수 있다.Once the training scheme properly trains the model by reconstructing the pseudo-true frames f _s , real frame interpolation can be applied during execution of the testing scheme. Two adjacent frames (f _i-1 , f _i+1 ) are warped toward each other by half (0.5 times) (ie, moved and distorted) as shown in (B) of FIG. 4 . Techniques for warping successive frames into intermediate positions towards each other or learning to predict intermediate frames have been used in frame interpolation tasks. Indeed, given a video sequence, every successive frame triplet (f _i−1 , f _i , f _i+1 ) can be used as input to a framework according to an example embodiment to generate a stabilized frame output. . However, the first frame and the last frame are maintained as they were. An option may be provided to further stabilize the generated video frames by iteratively applying the video stabilization method of the present invention to frames that have already been interpolated.

도 6은 반복적 프레임 보간이 시각적 안정성을 더욱 강화시킬 수 있음을 설명하기 위한 도면이다. 6 is a diagram for explaining that iterative frame interpolation can further enhance visual stability.

도 6에 도시된 바와 같이, 프레임(

)는 두 번의 프레임 보간을 반복하여 얻어지는 것이기 때문에, 프레임(

)의 공간 방위(spatial orientation)는 입력 프레임의 프레임(f_i-2)와 프레임(f_i+2)에 의해서도 영향을 받는다. 이에 비해, 프레임 보간을 한 번 반복하면 프레임(

)을 생성하는데, 이 프레임(

)은 프레임(f_i-1)와 프레임(f_i+1)에 의해서만 영향을 받는다. 즉, 프레임 보간의 반복 횟수가 많아질수록 현재 프레임으로부터 더 멀리 있는 프레임들이 현재 보간 (파란색 연결로 표시)에 영향을 주기 때문에 비디오 안정화가 더 강해질 수 있고, 이로 인해 적은 횟수의 반복으로 보간된 프레임(노란색 연결로 표시)보다 전체적인 안정화(global stabilization)가 이루어질 수 있다. As shown in Figure 6, the frame (

) is obtained by repeating frame interpolation twice, so the frame (

The spatial orientation of ) is also affected by the frame (f _i-2 ) and the frame (f _i+2 ) of the input frame. In contrast, if frame interpolation is repeated once, the frame (

), which creates this frame (

) is affected only by frames (f _i-1 ) and (f _i+1 ). In other words, the more iterations of frame interpolation, the stronger the video stabilization can be because frames farther from the current frame affect the current interpolation (indicated by the blue connections), resulting in interpolated frames with fewer iterations. More global stabilization can be achieved (indicated by the yellow connections).

도 7은 프레임 보간을 1회 반복한 경우(상단)와 5회 반복한 경우(하단)의 프레임 세그먼트의 안정화된 시퀀스를 보여준다. 7 shows stabilized sequences of frame segments when frame interpolation is repeated once (top) and when frame interpolation is repeated five times (bottom).

도 7을 참조하면, 프레임 보간의 반복 횟수가 1회인 경우, 프레임들 간의 일시적인 국부적인 변동(local fluctuation)이 나타난다. 이에 비해, 프레임 보간의 반복 횟수가 5회인 경우, 프레임들이 국부적인 변동이 크게 약화되어 전체적으로 안정화되어 있음을 확인할 수 있다. 이와 같이 반복이 많을수록 비디오의 전체적인 안정화가 더 커진다. 또한, 심층 반복적 프레임 보간의 결과로, 본 발명에 따른 심층 반복적 프레임 보간 방법은 이미지 경계에서 보이지 않는 영역(즉, 누락된 뷰)을 생성할 수 있다. 도 8은 그러한 효과가 얻어지는 것을 확인할 수 있는 이미지를 예시한다. 붉은 색 박스 내의 영상을 비교하면, 심층 반복적 프레임 보간 방법으로 얻어진 영상(Our result)은 입력 영상(INPUT)에서 보이지 않은 영역까지 표현함을 알 수 있다.Referring to FIG. 7 , when the number of repetitions of frame interpolation is one, temporary local fluctuations between frames appear. In contrast, when the number of repetitions of frame interpolation is 5 times, it can be confirmed that the local fluctuations of the frames are significantly weakened and stabilized as a whole. The more repetitions like this, the greater the overall stabilization of the video. Also, as a result of the deep iterative frame interpolation, the deep iterative frame interpolation method according to the present invention may create an invisible region (ie, a missing view) at an image boundary. Fig. 8 illustrates an image from which it can be confirmed that such an effect is obtained. Comparing the images in the red box, it can be seen that the image (our result) obtained by the deep iterative frame interpolation method expresses even the invisible area in the input image (INPUT).

여기에서 안정성 조정을 위한 또 다른 매개변수, 즉 보간에 사용할 프레임을 조정하는 건너뛰기 매개변수(skip parameter)를 제공할 수 있다. 예를 들어, 기본 보간은 프레임 f_i-1과 프레임 f_i+1을 인접한 프레임 (skip = 1)으로 사용하는 것이지만, 건너뛰기 매개변수를 2로 설정하면 프레임 f_i-2와 프레임 f_i+2를 입력으로 사용할 수 있다. 인접한 프레임들을 건너뛰지 않은 트리플렛 (예: 첫 번째 또는 마지막 프레임에 근접)에는 더 작은 건너뛰기 매개 변수가 할당될 수 있다.Here, you can provide another parameter for adjusting stability: a skip parameter that adjusts which frame to use for interpolation. For example, the default interpolation is to use frame f _i-1 and frame f _i+1 as adjacent frames (skip = 1), but setting the skip parameter to 2 results in frame f _i-2 and frame f _{i+ 2} can be used as input. Triplets that do not skip adjacent frames (eg close to the first or last frame) may be assigned a smaller skip parameter.

예시적인 실시예에서, 네트워크 구성 요소를 훈련시키기 위해 픽셀 단위의 색상 기반 손실 함수(loss functions)를 사용할 수 있다. l ²-손실은 흐릿해짐 결과(blurry result)를 생성하는 것으로 보고되었으므로, 다음과 같이 정의된 l ¹ -손실 함수를 사용한다.In an example embodiment, pixel-wise color-based loss functions may be used to train the network components. l ² -loss has been reported to produce a blurry result, so we use the l ¹ -loss function defined as:

......(1)

......(One)

여기서 L₁은 의사-실측 프레임(f_s)과 출력 프레임(즉, 최종 중간 프레임)(

) 간의 손실이다. 또한 VGG-19의 relu4_3 레이어로부터의 응답을 활용하는 지각 손실 함수(perceptual loss function)를 고려할 수 있다. where L ₁ is the pseudo-true frame (f _s ) and the output frame (ie, the final intermediate frame) (

) is the loss between We can also consider a perceptual loss function that utilizes the response from the relu4_3 layer of VGG-19.

......(2)

여기서

는 특징 벡터(feature vector)를 나타낸다. L₁과 L_p의 합을 최종 손실(L_out)로 취할 수 있다.here

represents a feature vector. The sum of L ₁ and L _p can be taken as the final loss (L _out ).

......(3)

최종 손실(L_out)이 전체 네트워크를 훈련시키기에 충분하지만, 중간 프레임(f_int)에 동일한 손실을 적용하면 훈련 속도가 빨라지고 성능이 향상될 수 있다. U-Net 에서 나온 결과만을 이용한 딥러닝 손실인 L_int는 다음과 같이 구할 수 있다. Although the final loss (L _out ) is sufficient to train the entire network, applying the same loss to the intermediate frames (f _int ) can speed up training and improve performance. The deep learning loss, _Lint , using only the results from U-Net can be obtained as follows.

......(4)

중간 프레임(f_int)은 본질적으로 의사-실측 프레임(f_s)를 재구성하는 것을 목표로 하기 때문에 이 손실을 적용하는 것이 안전할 수 있고, 이는 사실상 U-Net 구성 요소를 명시적으로 훈련시킬 수 있다.Since the intermediate frame (f _int ) essentially aims to reconstruct the pseudo-true frame (f _s ), it can be safe to apply this loss, which in effect trains the U-Net components explicitly. have.

도 9에는 본 발명의 예시적인 실시예에 따른 DIFRINT 프레임워크(100)의 U-Net(165) 및 ResNet(175) 아키텍처와 피드포워드 프로세스가 도시되어 있다.9 shows the U-Net 165 and ResNet 175 architectures and feedforward process of the DIFRINT framework 100 according to an exemplary embodiment of the present invention.

도 9를 참조하면, 광학 흐름 추정기(PWC-Net)는 두 입력 프레임의 광학 흐름을 추정하여 그 두 입력 프레임을 서로를 향해 절반씩 이동시킨 워핑 프레임(f_w-, f_w+)을 생성할 수 있다. 예시적인 실시예에 따른 프레임워크(100) 내에서 광학 흐름 추정기 (PWC-Net)는 고정되어 있어 훈련이 불가한 반면, 후술하는 네트워크 U-Net(165) 및 ResNet(175)은 훈련이 가능하다. Referring to FIG. 9, the optical flow estimator (PWC-Net) can estimate the optical flow of two input frames and generate warping frames (f _w- , f _w+ ) in which the two input frames are moved toward each other by half. have. In the framework 100 according to the exemplary embodiment, the optical flow estimator (PWC-Net) is fixed and cannot be trained, whereas the networks U-Net (165) and ResNet (175) described later can be trained. .

인접한 워핑 프레임(f_w-, f_w+)은 U-Net(165)에 공급될 수 있다. U-Net(165)는 그 입력된 워핑 프레임(f_w-, f_w+)을 연쇄되게 하여 중간 프레임(f_int)을 생성할 수 있다. 원본 프레임(f_i)은 중간 프레임(f_int) 쪽으로 워핑될 수 있고,

로 표시된다. 일예로, U-Net(165) 아키텍처는 스케일링된 기능들 중 3 개의 스킵 연결(점선 화살표)을 채용할 수 있다. U-Net(165) 아키텍처는 3 x 3 컨볼루션 레이어를 포함할 수 있으며, 히든 기능 채널의 크기는 32이고, 2배 축소/업 스케일될 수 있다.Adjacent warping frames f _w- and f _w+ may be supplied to the U-Net 165. The U-Net 165 may generate an intermediate frame f _int by concatenating the input warping frames f _w- and f _w+ . The original frame (f _i ) may be warped towards the intermediate frame (f _int ),

is indicated by As an example, the U-Net 165 architecture may employ three skip connections (dotted arrows) among the scaled functions. The U-Net (165) architecture may include 3 x 3 convolutional layers, the size of the hidden function channel is 32, and it may be reduced/upscaled by a factor of 2.

그런 다음 워핑된 원본 프레임(

)과 중간 프레임(f_int)은 ResNet(175)에 공급될 수 있다. ResNet(175)는 워핑된 원본 프레임(

)과 중간 프레임(f_int)을 이용하여 최종 중간(보간) 프레임(

)를 생성한다. 일예로, ResNet(175)는 5개의 잔차 블록(residual block)을 포함하는 아키텍쳐를 가질 수 있다. 잔차 블록은 1 x 1 컨볼 루션 레이어 (채널 크기 32)를 사용하여 최종 출력인 최종 중간(보간) 프레임(

)의 재구성하는 과정에서 인접한 픽셀들로부터의 노이즈를 최소화한다. 1 x 1 커널은 피드포워드(feed-forward)의 프로세스 속도를 높일 수 있다. Then the warped original frame (

) and intermediate frames (f _int ) can be fed to ResNet (175). ResNet(175) is the warped original frame (

) and the final intermediate (interpolation) frame using the intermediate frame (f _int ) (

) to create For example, ResNet 175 may have an architecture including 5 residual blocks. The residual block is the final intermediate (interpolated) frame (

) Minimize noise from adjacent pixels in the process of reconstruction. A 1 x 1 kernel can speed up the process of feed-forward.

예시적인 실시예에서, 훈련 가능한 전체 컴포넌트의 모든 컨볼루션 레이어는 이미지 페인팅 작업에서 우수한 품질을 보여주는 게이트 컨볼루션(3 x 3)을 사용할 수 있다. 게이트 컨볼루션을 사용하면 각 공간 (픽셀) 위치 및 각 채널에 대한 동적 기능 선택이 가능하다. 게이트 컨볼루션은 워핑 후 입력 이미지에 구멍이나 보이지 않는 영역이 포함될 수 있는 문제에 매우 적합할 수 있다. In an exemplary embodiment, all convolutional layers of the full trainable component may use gated convolution (3 x 3), which shows good quality in image painting tasks. Gated convolution allows dynamic feature selection for each spatial (pixel) location and each channel. Gate convolution can be well suited for problems where the input image may contain holes or non-visible regions after warping.

예시적인 실시예에 따른 프레임워크는 예컨대 PyTorch를 통해 구현될 수 있다. 예컨대 두 개의 NVIDIA Titan X (Maxwell)를 사용하는 훈련 과정은 약 1 일이 걸리며, 1280×720 프레임을 생성하는 데 0.07 초가 걸리며, 초당 프레임 수는 거의 실시간으로 15fps을 제공할 수 있다. 생성 프로세스에는 3 개의 광학 흐름 추정, 3 개의 워핑 레이어(warping layers) 및 U-Net 및 ResNet 아키텍처를 통한 피드 포워딩이 모두 포함된다. 실제로 CUDA의 멀티 스트림 기능을 사용하면 이전의 반복에서 입력을 위해 두 개의 프레임 생성이 완료되는 한 병렬로 프레임을 생성 할 수 있다.A framework according to an example embodiment may be implemented via PyTorch, for example. For example, the training process using two NVIDIA Titan X (Maxwell) takes about 1 day, takes 0.07 seconds to generate 1280×720 frames, and the frames per second can deliver 15 fps in near real time. The generation process includes all three optical flow estimations, three warping layers, and feed forwarding through U-Net and ResNet architectures. In fact, with CUDA's multi-stream capabilities, you can create frames in parallel as long as the previous iteration finishes generating two frames for input.

도 10에는 본 발명의 예시적인 실시예에 따른 DIFRINT 프레임워크를 이용한 비디오 안정화 방법(Present Invention)과 최근의 공지 상용 비디오 안정화 프로그램들(StabNet, Bundled, SteadyFlow, CodingFlow) 간의 양적 성능 평가(Quantitative evaluation)를 하여 비교한 결과가 도시되어 있다. 10 is a quantitative evaluation between a video stabilization method (Present Invention) using the DIFRINT framework according to an exemplary embodiment of the present invention and recent known commercial video stabilization programs (StabNet, Bundled, SteadyFlow, CodingFlow) The result of comparison is shown.

비디오 안정화 방법은 일반적으로 자르기 비(cropping ratio), 왜곡 값(distortion value) 및 안정성 점수(stability score)의 세 가지 요소로 평가될 수 있다. 각 측정 항목에 대해 좋은 결과는 1.0 값에 더 가깝다. 자르기 비는 누락된 장면(검은색) 경계를 잘라낸 후 남은 이미지 영역을 측정합니다. 자르기 비 값이 클수록 자르기가 적어 비디오 품질이 좋다. 왜곡 값은 입력 프레임과 출력 프레임 사이에서 호모그래피의 이방성 스케일링을 제공한다. 왜곡 값은 호모 그래피의 아핀 부분(affine part of the homography)에서 두 개의 가장 큰 고유 값(eigenvalues)의 비율로 계산할 수 있다. 모든 프레임의 호모그래피 중에서 최악의 비율이 왜곡 값으로 선택된다. 안정성 점수는 출력 비디오의 전체적인 평활도를 측정한다. 이 메트릭에는 주파수 영역 분석이 사용된다. 카메라 경로가 안정성 값을 계산하는 데 사용된다. 변환 및 회전 성분을 추출하여 2 개의 1D 프로파일 신호를 만들 수 있다. 가장 낮은 (2-6 번째) 주파수 에너지와 총 에너지의 합의 비율이 계산되고, 최종 안정성 점수는 최소값을 취하여 얻는다.A video stabilization method can generally be evaluated by three factors: a cropping ratio, a distortion value, and a stability score. For each metric, good results are closer to 1.0 values. The crop ratio measures the image area remaining after cropping the missing scene (black) boundaries. The higher the Crop Ratio value, the better the video quality with less cropping. The distortion value provides an anisotropic scaling of the homography between the input frame and the output frame. The distortion value can be calculated as the ratio of the two largest eigenvalues in the affine part of the homography. Among the homographies of all frames, the worst ratio is selected as the distortion value. The stability score measures the overall smoothness of the output video. Frequency domain analysis is used for this metric. The camera path is used to calculate the stability value. By extracting translational and rotational components, two 1D profile signals can be created. The ratio of the sum of the lowest (2nd to 6th) frequency energy and the total energy is calculated, and the final stability score is obtained by taking the minimum value.

총 12개의 공개 비디오를 사용하여 자르기 비(cropping ratio), 왜곡 값(distortion value) 및 안정성 점수(stability score)의 세 가지 평가 요소에 관한 성능 평가를 수행하였다. 성능 평가 결과에 의하면, 본 발명에 따른 방법이 12 개 비디오 중 대부분에 대해 최상의 성능을 보여주었다. 특히, 본 발명에 따른 비디오 안정화 방법은 자르기 비가 1.0이고, 모든 비디오에 대해 왜곡 값을 1.0에 가깝게 유지하며, 대부분의 비디오에서 가장 높은 안정성 점수를 보여준다. A total of 12 public videos were used to evaluate performance on three evaluation factors: cropping ratio, distortion value, and stability score. According to the performance evaluation results, the method according to the present invention showed the best performance for most of the 12 videos. In particular, the video stabilization method according to the present invention has a crop ratio of 1.0, keeps distortion values close to 1.0 for all videos, and shows the highest stability score for most videos.

도 11에는 질적 평가를 위해 본 발명에 따른 비디오 안정화 방법과 최근의 공지 상용 비디오 안정화 프로그램들(Bundled, Robust L1, MeshFlow)을 이용한 비디오 안정화 결과물들이 시각적으로 비교할 수 있도록 도시되어 있다.11 shows video stabilization results using the video stabilization method according to the present invention and recent known commercial video stabilization programs (Bundled, Robust L1, MeshFlow) for quality evaluation so that they can be visually compared.

도 11을 참조하면, 3 개의 비디오에서 안정화된 프레임의 예를 보여준다. 입력 프레임들(맨 우측) 내 녹색 상자는 세 가지 공지 기술로 얻은 비디오들 중에서 가장 큰 비디오 프레임의 잘라낸 영역을 나타낸다. 입력 프레임들(맨 우측)과 비교하여, 세 가지 공지 프로그램들로 안정화시킨 비디오들은 모두 입력 비디오 프레임들의 베이스라인을 어느 정도 잘라내고 영상을 확대한 효과를 보인 반면, 본 발명의 방법으로 안정화된 비디오에서는 그러한 베이스라인 자르기와 영상 확대 효과를 유발하지 않음을 확인할 수 있다. 즉, 본 발명의 비디오 안정화 방법으로 얻어진 영상은 마치 카메라가 비디오 프레임을 안정화하기 위해 이동하여 보이지 않는 영역을 잘라내지 않고 생성하는 것처럼 보인다. 보이지 않는 영역의 이러한 생성은 앞서 언급한 도 8에 예시된 영상에서 확인할 수 있다. 본 발명의 비디오 안정화 방법은 심층 반복 프레임 보간의 결과로, 입력 프레임에서 볼 수 없는 이미지 경계영역에 컨텐츠를 생성할 수 있다.Referring to FIG. 11, examples of stabilized frames in three videos are shown. The green box in the input frames (far right) represents the cropped area of the largest video frame among the videos obtained with the three known techniques. Compared with the input frames (far right), the videos stabilized by the three known programs all showed the effect of cutting out the baseline of the input video frames to some extent and enlarging the image, whereas the video stabilized by the method of the present invention , it can be confirmed that such a baseline cropping and image enlargement effect is not caused. That is, the image obtained by the video stabilization method of the present invention looks as if the camera is moving to stabilize the video frame and the invisible region is not cropped. This creation of the invisible region can be seen in the image illustrated in FIG. 8 mentioned above. The video stabilization method of the present invention can generate content in an image boundary region that cannot be seen in an input frame as a result of deep repeat frame interpolation.

도 12에는 본 발명에 따른 비디오 안정화 방법과 기존의 비디오 안정화 프로그램 StabNet을 통해 얻은 비디오 안정화 이미지가 대비되어 있다. 12 contrasts the video stabilization method according to the present invention with a video stabilization image obtained through the existing video stabilization program StabNet.

도 12를 참조하면, 노란색 상자(본 발명)와 빨간색 상자(StabNet)는 왜곡 아티팩트를 확대하여 보여준다. 녹색 상자는 본 발명의 비디오 중에서 StabNet으로 잘라낸 영상을 나타낸다. 본 발명의 비디오 안정화 방법은 안정화를 위해 프레임 보간 기술을 사용하기 때문에 전체 이미지 왜곡이 상당히 낮다. 이에 비해 StabNet으로 얻어진 비디오는 왜곡을 유발한다.Referring to FIG. 12, the yellow box (the present invention) and the red box (StabNet) show enlarged distortion artifacts. The green box represents an image cut out by StabNet among the videos of the present invention. Since the video stabilization method of the present invention uses a frame interpolation technique for stabilization, overall image distortion is considerably low. In comparison, video obtained with StabNet causes distortion.

본 발명의 예시적인 실시예에 따른 DIFRINT 프레임워크는 알려진 다른 상용 비디오 안정화 툴과 비교할 때 일반(Regular), 군중(Crowd), 시차(Parallax), 빠른 회전(Quick Rotation), 달리기(Running) 및 줌잉(Zooming)의 6 가지 범주에서 전반적인 성능이 더 우수하게 나타났다. 특히 달리기 카테고리와 같이 흔들림이 심한 까다로운 장면의 경우, 상용 알고리즘은 연속적인 누락된 보기를 보완하기 위해 여백이 크게 잘라내는데 비해, DIFRINT 프레임워크는 도 12에 도시된 바와 같이, 중요한 내용을 유지한다. 또한, 기존 상용 알고리즘들은 특히 흔들림이 심한 비디오 세그먼트에서 약간의 흔들림 현상과 왜곡이 발생하는 것을 확인할 수 있다.Compared to other known commercial video stabilization tools, the DIFRINT framework according to an exemplary embodiment of the present invention provides Regular, Crowd, Parallax, Quick Rotation, Running and Zooming. The overall performance was better in the six categories of (Zooming). Especially for challenging scenes with high shaking, such as the running category, commercial algorithms crop large margins to compensate for successive missing views, whereas the DIFRINT framework retains important content, as shown in Fig. 12. In addition, existing commercial algorithms can confirm that some shaking and distortion occur especially in video segments with severe shaking.

본 발명의 예시적인 실시예에 따른 비디오 안정화 방법의 작은 장점은 사용자가 프레임 보간의 반복 횟수와 프레임 건너뛰기 매개 변수를 원하는 대로 조정할 수 있다는 것이다. 일부 사용자는 특정 유형의 동영상에 대해 어느 정도의 불안정성을 남기기를 원할 수도 있다. 모든 비디오에 적용할 수 있는 한 가지 크기의 방법 대신 사용자에게 조작의 자유를 제공할 수 있다. A small advantage of the video stabilization method according to the exemplary embodiment of the present invention is that the user can adjust the number of iterations of frame interpolation and frame skipping parameters as desired. Some users may want to leave a certain amount of instability for certain types of videos. Instead of a one-size-fits-all method, you can give users freedom of manipulation.

본 발명에 따른 비디오 안정화 방법은 원치 않는 효과를 유발하지 않고 비디오만 안정화시킬 수 있다. 안정화 처리의 속도는 거의 실시간 계산 속도를 가능하게 한다. 본 발명의 비디오 안정화 방법은 반복 프레임 보간을 통한 비디오 안정화에 대한 비지도 딥 러닝 방식으로 시각적 왜곡이 적은 안정화된 풀 프레임 비디오를 만들 수 있다. 본 발명의 비디오 안정화 방법은 또한 원본 프레임을 사용하기 때문에 대부분의 프레임 내용이 보존된다. The video stabilization method according to the present invention can only stabilize video without causing unwanted effects. The speed of the stabilization process enables near real-time computational speed. The video stabilization method of the present invention is an unsupervised deep learning method for video stabilization through repeated frame interpolation, and can create a stabilized full-frame video with less visual distortion. Since the video stabilization method of the present invention also uses original frames, most of the frame contents are preserved.

이상에서 설명한 본 발명의 실시예들에 따른 비디오 안정화 방법은 다양한 컴퓨터 수단을 통하여 실행될 수 있는 컴퓨터 프로그램의 형태로 구현될 수 있다. 구현된 컴퓨터 프로그램은 컴퓨터 판독 가능 기록매체에 기록될 수 있다. 그 구현된 컴퓨터 프로그램을 컴퓨터 장치에서 연산처리장치가 기록 매체에 저장된 그 컴퓨터 프로그램을 읽어들여 실행하는 것에 의해 본 발명의 비디오 안정화 방법을 수행할 수 있다.The video stabilization method according to the embodiments of the present invention described above may be implemented in the form of a computer program that can be executed through various computer means. The implemented computer program may be recorded on a computer readable recording medium. The video stabilization method of the present invention can be performed by reading and executing the implemented computer program stored in a recording medium by an arithmetic processing unit in a computer device.

상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. Program commands recorded on the medium may be specially designed and configured for the embodiment or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. - includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

본 발명을 실행하기 위한 다양한 컴퓨터 수단은 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다. Various computer means for implementing the present invention include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable array (FPA), and a programmable logic unit (PLU). , microprocessor, or any other device capable of executing and responding to instructions, one or more general purpose or special purpose computers. A processing device may run an operating system (OS) and one or more software applications running on the operating system. A processing device may also access, store, manipulate, process, and generate data in response to execution of software. For convenience of understanding, there are cases in which one processing device is used, but those skilled in the art will understand that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that it can include. For example, a processing device may include a plurality of processors or a processor and a controller. Other processing configurations are also possible, such as parallel processors.

본 발명은 비디오 콘텐츠 산업 (YouTube, Vimeo), 비디오 편집기 요소기술 (Adobe Premiere, After Effects, kdenlive), 디스플레이 소프트웨어 (Samsung, LG TV, monitor, mobile display), 비디오 센서 소프트웨어 (High speed, mobile camera sensor), 비디오 인식 개선 알고리즘 (비디오 인식 어플리케이션) 등과 같은 영상 처리 기술분야에서 이용될 수 있다.The present invention relates to video content industry (YouTube, Vimeo), video editor element technology (Adobe Premiere, After Effects, kdenlive), display software (Samsung, LG TV, monitor, mobile display), video sensor software (High speed, mobile camera sensor ), video recognition improvement algorithm (video recognition application), and the like.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다. 그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.As described above, although the embodiments have been described with limited drawings, those skilled in the art can variously modify and change the present invention without departing from the spirit and scope of the present invention described in the claims below. You will understand that you can. Therefore, other implementations, other embodiments, and equivalents of the claims are within the scope of the following claims.

40, 60: 1단계 워핑부
40-1, 60-1: 제1 워핑부
40-2, 60-2: 제2 워핑부
45, 65: U-Net 모듈(합성곱 신경망)
50, 70: 2단계 워핑부
55, 75: ResNet 아키텍처 모듈
100: DIFRINT 프레임워크40, 60: 1st stage warping part
40-1, 60-1: first warping part
40-2, 60-2: second warping part
45, 65: U-Net module (convolutional neural network)
50, 70: 2nd stage warping part
55, 75: ResNet architecture module
100: DIFRINT framework

Claims

A method performed by executing a computer program on a computer device,
synthesizing intermediate frames of the first and second original frames by interpolating adjacent first and second original frames; and
stabilizing video frames by sequentially performing the 'combining intermediate frames of the first and second original frames' with respect to successive original video frames;
The synthesizing of the intermediate frame (f _int ) may include estimating a bidirectional optical flow between the first and second original frames; generating first and second warped frames by warping the first and second original frames toward each other halfway based on the estimated optical flow; and estimating and generating intermediate frames of the first and second original frames by learning the first and second warping frames through a convolutional neural network,
The estimation of the intermediate frame is performed so that fine details are preserved by predicting local high-resolution image information with global low-resolution image information by learning how to combine the first and second warping frames of different scales with each other. A video stabilization method characterized by being.

delete

The video stabilization method of claim 1, wherein the bidirectional optical flow is estimated using a PWC-Net module, which is a convolutional neural network (CNN) model for estimating optical flow.

The video stabilization method of claim 1, wherein the intermediate frame is estimated using a U-Net module, which is a convolutional neural network circuit.

delete

The method of claim 1 , in order to prevent image blur accumulation, a third original frame corresponding to the intermediate frame is used every iteration of frame interpolation to correct minute detail errors included in the intermediate frame, A video stabilization method further comprising generating a final intermediate frame.

7. The method of claim 6, wherein the generating of the final intermediate frame comprises: warping the third original frame between the first and second original frames toward the intermediate frame by using the intermediate frame as a reference frame; supplying the warped third original frame together with the intermediate frame to a ResNet module; and generating, in the ResNet module, a final interpolated (intermediate) frame by correcting minute detail errors included in the intermediate frame using the warped third original frame.

7. The method of claim 6, wherein the 'combining intermediate frames' and the 'sequentially performing' are performed at least once with respect to consecutive video frames generated through the 'generating the final intermediate frames'. A video stabilization method, further comprising enhancing visual stability of video frames based on iteratively performing deep iterative frame interpolation.

The method of claim 6 , further comprising adjusting frame stability by introducing a skip parameter for adjusting frames to be used for interpolation and setting a value of the skip parameter to a desired value during the frame interpolation. A video stabilization method comprising the steps of:

The method of claim 6 , wherein a pseudo-ground truth frame, which is a spatial transformation version of the third original frame, is defined, and the first and second original frames preceding and following the third original frame are defined. The video stabilization method of claim 1 , further comprising training a frame interpolation model by warping toward the pseudo-real-time frame and reconstructing the pseudo-real-measurement frame (f _s ).

11. The video stabilization method of claim 10, wherein the spatial transformation of the third original frame can be performed in an arbitrary direction and with an arbitrary scale, wherein the arbitrary scale is a value within 1/8 of a frame width.

A method performed by executing a computer program on a computer device,
synthesizing intermediate frames of the first and second original frames by interpolating adjacent first and second original frames;
sequentially performing the 'combining intermediate frames of the first and second original frames' on successive original video frames;
In order to prevent the accumulation of image blur, a third original frame corresponding to the intermediate frame is used every time frame interpolation is repeated, and fine detail errors included in the intermediate frame are corrected to generate a final intermediate frame. step; and
For the continuous video frames generated through the 'generating the final intermediate frame', the 'synthesizing the intermediate frame' and the 'sequentially performing the deep repetitive frame' are repeated at least once. A method for stabilizing video comprising enhancing visual stability of video frames based on interpolation.

13. The method of claim 12, wherein synthesizing the intermediate frame (f _int ) comprises: estimating bi-directional optical flow between the first and second original frames; generating first and second warped frames by warping the first and second original frames toward each other halfway based on the estimated optical flow; and estimating and generating intermediate frames of the first and second original frames by learning the first and second warping frames through a convolutional neural network.

13. The method of claim 12, wherein the estimation of the intermediate frame is performed by learning a method of combining the first and second warping frames of different scales and predicting local high-resolution image information with global low-resolution image information. A video stabilization method characterized in that it is performed so that details) are preserved.

13. The method of claim 12, wherein the generating of the final intermediate frame comprises: warping the third original frame between the first and second original frames toward the intermediate frame by using the intermediate frame as a reference frame; supplying the warped third original frame together with the intermediate frame to a ResNet module; and generating, in the ResNet module, a final interpolated (intermediate) frame by correcting minute detail errors included in the intermediate frame using the warped third original frame.

The method of claim 12 , further comprising adjusting frame stability by introducing a skip parameter for adjusting frames to be used for interpolation and setting a value of the skip parameter to a desired value during the frame interpolation. A video stabilization method characterized by doing.

13. The method of claim 12, wherein a pseudo-ground truth frame, which is a spatially transformed version of the third original frame, is defined, and the first and second original frames preceding and following the third original frame are defined. The video stabilization method of claim 1 , further comprising training a frame interpolation model by warping toward the pseudo-real-time frame and reconstructing the pseudo-real-measurement frame (f _s ).

A computer executable program stored in a computer readable recording medium to perform the video stabilization method according to any one of claims 1, 3, 4, and 6 to 17.

Claims 1, 3, 4, 6 to 17, a computer readable recording medium on which a computer program for performing the video stabilization method according to any one of claims 17 is recorded.