KR20190091806A

KR20190091806A - Video sequences generating system using generative adversarial networks and the method thereof

Info

Publication number: KR20190091806A
Application number: KR1020180010817A
Authority: KR
Inventors: 노용만; 위삼 자랄 알하즈 바다르; 구건모; 이상민
Original assignee: 한국과학기술원
Priority date: 2018-01-29
Filing date: 2018-01-29
Publication date: 2019-08-07
Also published as: KR102095097B1

Abstract

The present invention relates to a video sequence generating system using dynamics transfer generative adversarial networks (GAN) generating video sequences based on generative adversarial learning, and to a method thereof. The video sequence generating system generates video sequences in the variable length using dynamic properties of source video data and an outward form (spatial structure) of target image data, and can discriminate the spatial and temporal consistency of video sequences using discriminator networks including two discriminating units.

Description

VIDEO SEQUENCES GENERATING SYSTEM USING GENERATIVE ADVERSARIAL NETWORKS AND THE METHOD THEREOF

본 발명은 생성적 적대 네트워크를 이용한 비디오 시퀀스 생성 시스템 및 그 방법에 관한 것으로, 보다 상세하게는 생성적 적대 학습(generative adversarial learning)을 기반으로 비디오 시퀀스를 생성하는 동적 이동 생성적 적대 네트워크(Dynamics Transfer GAN)를 이용한 시스템 및 그 방법에 관한 것이다. The present invention relates to a video sequence generation system using a generative antagonist network and a method thereof, and more particularly, to a dynamic motion transfer antagonist network generating a video sequence based on generative adversarial learning. A system using GAN) and a method thereof are provided.

최근 생성적 모델(generative models)의 진보는 이미지 합성(synthesis) 연구에 영향을 미쳤다. 생성적 모델, 특히 생성적 적대 네트워크(Generative Adversarial Network, GAN)는 랜덤 분포로부터 이미지를 생성하거나, 프라이밍(priming) 이미지를 합성 이미지로 비선형 변환하여 이미지를 합성하거나, 합성 이미지를 소스 이미지 도메인에서 다른 도메인으로 이동시킨다.Recent advances in generative models have influenced the study of image synthesis. Genetic models, in particular Genetic Adversarial Networks (GANs), generate images from random distributions, synthesize images by nonlinear conversion of priming images to composite images, or synthesize composite images from the source image domain. Move to the domain.

이로 인해, 비디오 시퀀스를 생성하기 위한 생성적 모델의 기능을 확장하려는 연구가 진행되고 있다. 그러나, 비디오 시퀀스를 생성하기 위해서는, 생성적 모델이 장면(scene) 모션을 구동하는 동적 특성(dynamics)뿐 아니라, 장면의 공간 구조(외형)를 이해해야 한다. 또한, 생성적 모델은 가변 시퀀스 길이를 가지는 시간 변화를 재구성할 수 있어야 한다.For this reason, research is being conducted to extend the function of the generative model for generating video sequences. However, in order to generate a video sequence, the model must understand the spatial structure (appearance) of the scene as well as the dynamics driving the scene motion. In addition, the generative model must be able to reconstruct a time change with a variable sequence length.

다만, 대부분의 경우, 동적 특성은 비강체(nonrigid)이거나 공간 구조의 모양 변형을 일으킬 수 있으므로, 전술한 연구는 비디오 생성을 위한 생성적 모델의 효율성을 저해할 수 있다는 한계가 존재하였다.However, in most cases, the dynamic characteristics may be non-rigid or may cause shape deformation of the spatial structure. Therefore, there is a limit that the above-described studies may impair the efficiency of the generating model for video generation.

이러한 한계를 극복하기 위해, 일부 연구는 생성된 비디오 시퀀스의 미래 프레임을 예측하기 위해 3D 컨볼루션과 순환신경망(recurrent neural networks; RNN) 및 컨볼루션 LSTM(Long ShortTerm Memory)의 조합에 대한 연구를 진행하였으나, 미래의 몇 프레임을 예측하는 것은 비디오 생성과는 다른 조건부 이미지 생성 문제로 간주되었다. To overcome these limitations, some studies have investigated the combination of 3D convolution, recurrent neural networks (RNN), and convolutional Long Short Memory (LSTM) to predict future frames of generated video sequences. However, predicting some future frames was considered a conditional image generation problem different from video generation.

C. Vondrick et al.는 장면 동적 특성을 사용하여 비디오를 생성하는 GAN에 대한 확장을 제안하였다. 이 때, 생성기는 장면을 전경과 배경의 조합으로 모델링하는 두 개의 스트림으로 구성되며, 3D 컨볼루션은 생성된 시퀀스를 검토하는 시공간 판별기 역할을 수행하는데 사용된다.C. Vondrick et al. Proposed an extension to the GAN that generates video using scene dynamics. At this time, the generator consists of two streams that model the scene as a combination of foreground and background, and 3D convolution is used to serve as a space-time discriminator that examines the generated sequence.

M. Saito et al.는 비슷한 2개의 스트림 생성기 중 하나인 시공간 판별기 접근법을 제안하였고, S. Tulyakov et al.는 입력 분포에 대한 샘플링 절차를 내용 부분 공간과 동작 부분 공간의 샘플로 분리하여 가변 길이 시퀀스를 생성하는 과정을 제안하였다.M. Saito et al. Proposed a spatiotemporal discriminator approach, one of two similar stream generators, and S. Tulyakov et al. Split the sampling procedure for the input distribution into samples of the content subspace and motion subspace. We proposed a process for generating a length sequence.

다만, C. Vondrick et al. 및 M. Saito et al.는 가변 길이 비디오 시퀀스를 모델할 수 없으며, 길이가 긴 시퀀스를 생성할 수 없다는 문제점이 존재하였다.However, C. Vondrick et al. And M. Saito et al. Have been unable to model variable length video sequences and were unable to generate long sequences.

나아가, C. Vondrick et al., M. Saito et al. 및 S. Tulyakov et al.는 생성적 적대 네트워크(GAN)가 비디오를 생성할 수 있게끔 확장된 구조를 제안하였으나, 시공간 판별기는 고정된 크기의 3D 컨볼루션을 사용하여 수행되었으므로, 생성된 비디오의 시공간적 일관성이 고정된 작은 시퀀스 크기에서 제한적으로 검증되어야 한다는 한계가 존재하였다. 또한, 동적 특성은 시공간 인코딩 과정에서 공간적 구조와 결합되므로, 지정된 특정 공간적 구조(외형)에 적용되지 못한다는 한계가 존재하였다. Furthermore, C. Vondrick et al., M. Saito et al. And S. Tulyakov et al. Proposed an extended structure for generating generative host networks (GANs) to generate video, but the spatiotemporal discriminator was performed using a fixed-size 3D convolution so that the spatiotemporal There was a limitation that limited verification should be performed on small sequence sizes with fixed consistency. In addition, since the dynamic characteristics are combined with the spatial structure in the space-time encoding process, there is a limitation that it cannot be applied to a specific spatial structure (appearance) designated.

C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In Advances In Neural Information Processing Systems, pages 613621, 2016.C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In Advances In Neural Information Processing Systems, pages 613x621, 2016. M. Saito, E. Matsumoto, and S. Saito. Temporal generative adversarial nets with singular value clipping. In IEEE International Conference on Computer Vision (ICCV), pages 28302839, 2017.M. Saito, E. Matsumoto, and S. Saito. Temporal generative adversarial nets with singular value clipping. In IEEE International Conference on Computer Vision (ICCV), pages 2830-2839, 2017. S. Tulyakov, M.Y. Liu, X. Yang, and J. Kautz. Mocogan: Decomposing motion and content for video generation. arXiv preprint arXiv:1707.04993, 2017.S. Tulyakov, M. Y. Liu, X. Yang, and J. Kautz. Mocogan: Decomposing motion and content for video generation. arXiv preprint arXiv: 1707.04993, 2017.

본 발명의 목적은 소스 비디오 데이터의 동적 특성과 대상 이미지 데이터의 외형(공간적 구조)을 이용하여 가변 길이의 비디오 시퀀스를 생성할 수 있는 기술을 제공하고자 한다.An object of the present invention is to provide a technique capable of generating a variable length video sequence using the dynamic characteristics of the source video data and the appearance (spatial structure) of the target image data.

또한, 본 발명의 목적은 소스 비디오 데이터의 외형을 억제하고, 대상 이미지 데이터에 부과되기 전에 소스 비디오 데이터의 동적 특성만을 획득하여 대상 이미지 데이터의 외형에 적용시킴으로써, 대상 이미지 데이터의 공간적 구조를 보존할 수 있는 기술을 제공하고자 한다. In addition, an object of the present invention is to suppress the appearance of the source video data, and to obtain only the dynamic characteristics of the source video data and apply it to the appearance of the target image data before being imposed on the target image data, thereby preserving the spatial structure of the target image data. To provide a technology that can be.

또한, 본 발명의 목적은 2개의 판별부를 포함하는 판별기(discriminator) 네트워크를 이용하여 비디오 시퀀스의 공간적 및 시간적 일관성을 판별할 수 있는 기술을 제공하고자 한다. It is also an object of the present invention to provide a technique capable of determining spatial and temporal coherence of a video sequence using a discriminator network comprising two discriminators.

또한, 본 발명의 목적은 생성기 네트워크와 판별기 네트워크를 교대로 적대적 학습하며, 결과적으로 생성기 네트워크를 이용하여 품질이 향상된 최종 비디오 시퀀스를 생성할 수 있는 기술을 제공하고자 한다. Another object of the present invention is to provide a technique for alternately learning a generator network and a discriminator network, and as a result, generating a final video sequence having improved quality using the generator network.

본 발명의 실시예에 따른 생성적 적대 네트워크를 이용한 비디오 시퀀스 생성 시스템은 소스 비디오 데이터에서, 동적 특성 인코더를 통해 시간 동적 특성이 인코딩된 소스 비디오 데이터에 대한 외형 억압 동적 특성을 특정 채널에 삽입하는 동적 채널 임베디드부, 대상 이미지 데이터의 외형과 상기 임베디드된 특정 채널을 결합하여 가변 길이의 비디오 시퀀스를 생성하는 생성부 및 상기 소스 비디오 데이터와 상기 비디오 시퀀스 사이의 프레임 정확도를 구별하고, 상기 소스 비디오 데이터에 대한 외형 억압 동적 특성과 비디오 시퀀스에 대한 외형 억압 동적 특성 사이의 동적 특성 일관성을 구별하는 판별부를 포함한다.According to an embodiment of the present invention, a video sequence generation system using a generative antagonist network inserts dynamic suppression dynamic characteristics of source video data encoded with temporal dynamic characteristics through a dynamic characteristic encoder into a specific channel. A channel embedding unit, a generator which combines the appearance of target image data with the embedded specific channel to generate a variable length video sequence, and distinguishes frame accuracy between the source video data and the video sequence, And a discrimination section for distinguishing the dynamic characteristic consistency between the contour suppression dynamic characteristic for the video sequence and the contour suppression dynamic characteristic for the video sequence.

상기 동적 채널 임베디드부는 상기 소스 비디오 데이터에 대한 외형 억압 동적 특성을 상기 특정 채널에 삽입하여 시간 T에서의 상기 소스 비디오 데이터에 대한 외형과 외형 억압 동적 특성을 결합할 수 있다.The dynamic channel embedded unit may insert an appearance suppression dynamic characteristic of the source video data into the specific channel to combine the appearance suppression dynamic characteristic of the source video data at time T.

상기 동적 채널 임베디드부는 드롭아웃(dropout)을 이용하여 상기 소스 비디오 데이터에 대한 외형 억압 동적 특성에 노이즈를 생성하는 것이 특징일 수 있다.The dynamic channel embedded unit may generate noise in an external suppression dynamic characteristic of the source video data by using a dropout.

상기 생성부는 유넷 네트워크(Unet Network) 구조를 사용하여 상기 대상 이미지 데이터의 외형을 보존할 수 있다.The generation unit may preserve the appearance of the target image data by using a U? Net network structure.

상기 생성부는 상기 판별부를 통해 판별되는 상기 비디오 시퀀스에 대한 공간적 및 시간적 일관성에 기초하여 최종 비디오 시퀀스를 생성할 수 있다.The generator may generate a final video sequence based on spatial and temporal consistency with respect to the video sequence determined by the determiner.

상기 판별부는 상기 소스 비디오 데이터와 상기 비디오 시퀀스 사이의 프레임 정확도를 구별하는 공간 판별부 및 상기 소스 비디오 데이터에 대한 외형 억압 동적 특성과 상기 비디오 시퀀스에 대한 외형 억압 동적 특성 사이의 동적 특성 일관성을 구별하는 동적 판별부를 포함할 수 있다.The discriminator distinguishes a dynamic characteristic consistency between a spatial discriminator that distinguishes frame accuracy between the source video data and the video sequence, and a dynamic suppression dynamic characteristic for the source video data, and a dynamic suppression dynamic characteristic for the video sequence. It may include a dynamic determination unit.

상기 공간 판별부는 상기 소스 비디오 데이터의 프레임과 상기 비디오 시퀀스의 프레임을 비교하여 상기 비디오 시퀀스의 프레임이 실제 프레임 또는 생성된(가짜) 프레임인지 구별할 수 있다.The spatial discriminator may compare the frame of the source video data with the frame of the video sequence to distinguish whether the frame of the video sequence is an actual frame or a generated (fake) frame.

상기 동적 판별부는 상기 동적 특성 인코더를 통해 획득되는 상기 비디오 시퀀스에 대한 외형 억압 동적 특성과 상기 소스 비디오 데이터에 대한 외형 억압 동적 특성을 비교하여 상기 비디오 시퀀스에 대한 외형 억압 동적 특성이 현실 동적 특성 또는 생성된(가짜) 동적 특성인지 구별할 수 있다.The dynamic determining unit compares the appearance suppression dynamic characteristic of the video sequence obtained by the dynamic characteristic encoder with the appearance suppression dynamic characteristic of the source video data, thereby generating a realistic dynamic characteristic or a shape suppression dynamic characteristic of the video sequence. You can tell if it is a dynamic (fake) dynamic property.

상기 동적 판별부는 상기 비디오 시퀀스의 가변 길이에 영향을 받지 않기 위해, 시간 T까지의 상기 비디오 시퀀스에 대한 외형 억압 동적 특성만을 이용하여 동적 특성 일관성을 구별할 수 있다.In order not to be affected by the variable length of the video sequence, the dynamic determination unit may distinguish dynamic characteristic consistency using only the shape suppression dynamic characteristic for the video sequence up to time T.

상기 동적 특성 인코더는 상기 소스 비디오 데이터 또는 상기 비디오 시퀀스의 제1 프레임을 복제하여 정적 시퀀스를 생성하고, 순환신경망(recurrent neural networks; RNN)를 이용하여 상기 소스 비디오 데이터 또는 상기 비디오 시퀀스의 잠재 시공간 특징 및 상기 정적 시퀀스의 잠재 시공간 특징을 획득하며, 상기 소스 비디오 데이터 또는 상기 비디오 시퀀스의 잠재 시공간 특징에서 상기 정적 시퀀스의 잠재 시공간 특징을 제거하여 상기 소스 비디오 데이터에 대한 외형 억압 동적 특성 또는 상기 비디오 시퀀스에 대한 외형 억압 동적 특성을 제공할 수 있다.The dynamic characteristic encoder generates a static sequence by duplicating the first frame of the source video data or the video sequence and uses latent space-time features of the source video data or the video sequence using recurrent neural networks (RNN). And obtaining latent spatiotemporal features of the static sequence, removing latent spatiotemporal features of the static sequence from the source video data or latent spatiotemporal features of the video sequence, and adding the latent spatiotemporal features to the source video data. It can provide dynamic suppression characteristics.

본 발명의 다른 실시예에 따른 생성적 적대 네트워크를 이용한 비디오 시퀀스 생성 시스템은 소스 비디오 데이터의 동적 특성과 대상 이미지 데이터의 외형을 이용하여 가변 길이의 비디오 시퀀스를 생성하고, 판별 결과에 기초하여 최종 비디오 시퀀스를 생성하는 생성기 네트워크 및 상기 소스 비디오 데이터와 상기 비디오 시퀀스 사이의 프레임 정확도를 구별하고, 소스 비디오 데이터에 대한 외형 억압 동적 특성과 비디오 시퀀스에 대한 외형 억압 동적 특성 사이의 동적 특성 일관성을 구별하는 판별기 네트워크를 포함하되, 상기 생성기 네트워크 및 상기 판별기 네트워크는 교대로 수행되는 것을 특징으로 한다.A video sequence generation system using a generative host network according to another embodiment of the present invention generates a video sequence having a variable length by using dynamic characteristics of source video data and an appearance of target image data, and generates a final video based on a determination result. A generator network for generating a sequence and discriminating frame accuracy between the source video data and the video sequence, and discriminating between dynamic feature consistency between an appearance suppression dynamic characteristic for the source video data and an appearance suppression dynamic characteristic for the video sequence. And a generator network, wherein the generator network and the discriminator network are alternately performed.

상기 생성기 네트워크는 상기 소스 비디오 데이터에서, 동적 특성 인코더를 통해 시간 동적 특성이 인코딩된 상기 소스 비디오 데이터에 대한 외형 억압 동적 특성을 특정 채널에 삽입하는 동적 채널 임베디드부 및 상기 대상 이미지 데이터의 외형과 상기 임베디드된 특정 채널을 결합하여 가변 길이의 상기 비디오 시퀀스를 생성하는 생성부를 포함할 수 있다. The generator network may include, in the source video data, a dynamic channel embedded unit for inserting a shape suppression dynamic characteristic of the source video data encoded with a time dynamic characteristic through a dynamic characteristic encoder into a specific channel, and the appearance of the target image data and the It may include a generator for combining the specific embedded channel to generate the video sequence of a variable length.

상기 판별기 네트워크는 상기 소스 비디오 데이터와 상기 비디오 시퀀스 사이의 프레임 정확도를 구별하는 공간 판별부 및 상기 소스 비디오 데이터에 대한 외형 억압 동적 특성과 상기 비디오 시퀀스에 대한 외형 억압 동적 특성 사이의 동적 특성 일관성을 구별하는 동적 판별부를 포함할 수 있다.The discriminator network is configured to determine a dynamic discrimination between a spatial discriminator that distinguishes frame accuracy between the source video data and the video sequence, and an appearance suppression dynamic characteristic for the source video data and an appearance suppression dynamic characteristic for the video sequence. It may include a dynamic discriminating unit for distinguishing.

본 발명의 실시예에 따른 생성적 적대 네트워크를 이용한 비디오 시퀀스 생성 시스템의 동작 방법에 있어서, 동적 특성 인코더를 통해 시간 동적 특성이 인코딩된 소스 비디오 데이터에 대한 외형 억압 동적 특성을 특정 채널에 삽입하는 단계, 대상 이미지 데이터의 외형과 상기 임베디드된 특정 채널을 결합하여 가변 길이의 비디오 시퀀스를 생성하는 단계, 상기 소스 비디오 데이터와 상기 비디오 시퀀스 사이의 프레임 정확도를 구별하고, 상기 소스 비디오 데이터에 대한 외형 억압 동적 특성과 비디오 시퀀스에 대한 외형 억압 동적 특성 사이의 동적 특성 일관성을 구별하는 단계 및 구별 결과에 기초하여, 상기 비디오 시퀀스에 대한 공간적 및 시간적 일관성에 기초하여 최종 비디오 시퀀스를 출력하는 단계를 포함한다.A method of operating a video sequence generation system using a generative antagonist network according to an embodiment of the present invention, inserting a shape suppression dynamic characteristic for source video data encoded with a temporal dynamic characteristic through a dynamic characteristic encoder into a specific channel Combining the appearance of the target image data with the embedded specific channel to generate a variable length video sequence, distinguishing frame accuracy between the source video data and the video sequence, and suppressing the shape of the source video data. Distinguishing the dynamic feature consistency between the feature and the shape suppression dynamic feature for the video sequence and outputting the final video sequence based on the spatial and temporal consistency for the video sequence.

상기 소스 비디오 데이터를 특정 채널에 삽입하는 단계는 상기 소스 비디오 데이터에 대한 외형 억압 동적 특성을 상기 특정 채널에 삽입하여 시간 T에서의 상기 소스 비디오 데이터에 대한 외형과 외형 억압 동적 특성을 결합할 수 있다.The step of inserting the source video data into a specific channel may combine the appearance suppression dynamic characteristic of the source video data at time T by inserting the contour suppression dynamic characteristic of the source video data into the specific channel. .

상기 가변 길이의 비디오 시퀀스를 생성하는 단계는 유넷 네트워크(Unet Network) 구조를 사용하여 상기 대상 이미지 데이터의 외형을 보존할 수 있다.The generating of the variable length video sequence may preserve the appearance of the target image data by using a Unet network structure.

상기 프레임의 정확도를 구별하고, 동적 특성 일관성을 구별하는 단계는 상기 소스 비디오 데이터의 프레임과 상기 비디오 시퀀스의 프레임을 비교하여 상기 비디오 시퀀스의 프레임이 실제 프레임 또는 생성된(가짜) 프레임인지 구별하는 단계를 포함할 수 있다.Discriminating the accuracy of the frame and distinguishing the dynamic characteristic consistency comprises comparing the frame of the source video data with the frame of the video sequence to distinguish whether the frame of the video sequence is a real frame or a generated (fake) frame. It may include.

상기 프레임의 정확도를 구별하고, 동적 특성 일관성을 구별하는 단계는 상기 동적 특성 인코더를 통해 획득되는 상기 비디오 시퀀스에 대한 외형 억압 동적 특성과 상기 소스 비디오 데이터에 대한 외형 억압 동적 특성을 비교하여 상기 비디오 시퀀스에 대한 외형 억압 동적 특성이 현실 동적 특성 또는 생성된(가짜) 동적 특성인지 구별하는 단계를 포함할 수 있다. The step of distinguishing the accuracy of the frame and distinguishing the dynamic characteristic consistency may be performed by comparing the contour suppression dynamic characteristic for the video sequence obtained through the dynamic characteristic encoder with the contour suppression dynamic characteristic for the source video data. Distinguishing whether the contour suppression dynamic characteristic for the reality dynamic characteristic or the generated (fake) dynamic characteristic.

본 발명의 실시예에 따르면, 소스 비디오 데이터의 동적 특성과 대상 이미지 데이터의 외형(공간적 구조)을 이용하여 가변 길이의 비디오 시퀀스를 생성할 수 있다.According to an embodiment of the present invention, a variable length video sequence may be generated using the dynamic characteristics of the source video data and the appearance (spatial structure) of the target image data.

또한, 본 발명의 실시예에 따르면, 소스 비디오 데이터의 외형을 억제하고, 대상 이미지 데이터에 부과되기 전에 소스 비디오 데이터의 동적 특성만을 획득하여 대상 이미지 데이터의 외형에 적용시킴으로써, 대상 이미지 데이터의 공간적 구조를 보존할 수 있다.In addition, according to an embodiment of the present invention, the spatial structure of the target image data by suppressing the appearance of the source video data, by applying only the dynamic characteristics of the source video data to the appearance of the target image data before being imposed on the target image data Can be preserved.

또한, 본 발명의 실시예에 따르면, 2개의 판별부를 포함하는 판별기(discriminator) 네트워크를 이용하여 비디오 시퀀스의 공간적 및 시간적 일관성을 판별할 수 있다.In addition, according to an embodiment of the present invention, spatial and temporal coherence of a video sequence may be determined using a discriminator network including two discriminating units.

또한, 본 발명의 실시예에 따르면, 생성기 네트워크와 판별기 네트워크를 교대로 적대적 학습하며, 결과적으로 생성기 네트워크를 이용하여 품질이 향상된 최종 비디오 시퀀스를 생성할 수 있다.In addition, according to an embodiment of the present invention, the generator network and the discriminator network are alternately trained, and as a result, the final network sequence having improved quality may be generated using the generator network.

도 1은 본 발명의 실시예에 따른 생성적 적대 네트워크를 이용한 비디오 시퀀스 생성 시스템의 전체 과정을 설명하기 위해 도시한 것이다.
도 2는 본 발명의 실시예에 따른 생성적 적대 네트워크를 이용한 비디오 시퀀스 생성 시스템의 세부 구성을 도시한 것이다.
도 3은 본 발명의 실시예에 따른 동적 특성 인코더의 동작을 설명하기 위해 도시한 것이다.
도 4는 본 발명의 실시예에 따른 생성적 적대 네트워크를 이용한 비디오 시퀀스 생성 방법의 흐름도를 도시한 것이다.1 is a diagram illustrating an entire process of a video sequence generation system using a generative hostile network according to an exemplary embodiment of the present invention.
2 illustrates a detailed configuration of a video sequence generation system using a generative antagonist network according to an embodiment of the present invention.
3 is a diagram illustrating the operation of a dynamic characteristic encoder according to an embodiment of the present invention.
4 is a flowchart of a video sequence generation method using a generative antagonist network according to an embodiment of the present invention.

이하, 본 발명에 따른 실시예들을 첨부된 도면을 참조하여 상세하게 설명한다. 그러나 본 발명이 실시예들에 의해 제한되거나 한정되는 것은 아니다. 또한, 각 도면에 제시된 동일한 참조 부호는 동일한 부재를 나타낸다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. However, the present invention is not limited or limited by the embodiments. Also, like reference numerals in the drawings denote like elements.

또한, 본 명세서에서 사용되는 용어(terminology)들은 본 발명의 바람직한 실시예를 적절히 표현하기 위해 사용된 용어들로서, 이는 시청자, 운용자의 의도 또는 본 발명이 속하는 분야의 관례 등에 따라 달라질 수 있다. 따라서, 본 용어들에 대한 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다. Also, the terminology used herein is a term used to properly express a preferred embodiment of the present invention, which may vary depending on a viewer, an operator's intention, or customs in the field to which the present invention belongs. Therefore, the definitions of the terms should be made based on the contents throughout the specification.

본 발명은 대상 이미지 데이터에 소스 비디오 데이터에서의 동적 특성(dynamics)을 적용하여 비디오 시퀀스를 생성하는 생성적 적대 네트워크를 이용한 비디오 시퀀스 생성 시스템 및 그 방법에 관한 것이다.The present invention relates to a video sequence generation system and method using a generative antagonist network for generating video sequences by applying dynamics in source video data to target image data.

제안된 방법의 주요 공헌은 다음과 같다.The main contributions of the proposed method are:

1. 소스 비디오 데이터에서 동적 특성을 획득하는 동안 대상 이미지 데이터의 공간적 구조(외형)을 유지한다. 이를 위해, 소스 비디오 데이터의 시간적 동적 특성을 유지하면서 소스 비디오 데이터의 공간적 외형을 억제하는 새로운 외형 억제 기능을 제안한다.1. Maintain the spatial structure (appearance) of the target image data while acquiring dynamic characteristics from the source video data. To this end, we propose a new shape suppression function that suppresses the spatial appearance of the source video data while maintaining the temporal dynamic characteristics of the source video data.

2. 비디오 시퀀스의 프레임 정확도를 판별하는 공간 판별부 및 비디오 시퀀스의 무결성을 판별하는 동적 판별부를 사용하여 시퀀스 길이에 대한 제한이 없는 가변 길이의 비디오 시퀀스를 생성한다. 2. A variable length video sequence having no limitation on the sequence length is generated by using a spatial discriminator for determining the frame accuracy of the video sequence and a dynamic discriminator for determining the integrity of the video sequence.

3. 소스 비디오 데이터에 대한 동적 특성(dynamics)의 시각화를 제공한다. 3. Provide visualization of dynamics for source video data.

4. 최종적 테스트 시, 생성기 네트워크에 입력되는 소스 비디오 데이터와 대상 이미지 데이터를 이용하여 대상 이미지 데이터에 소스 비디오 데이터의 동적 특성을 적용하여 비디오 시퀀스를 생성한다.4. In the final test, the video sequence is generated by applying the dynamic characteristics of the source video data to the target image data using the source video data and the target image data input to the generator network.

도 1은 본 발명의 실시예에 따른 생성적 적대 네트워크를 이용한 비디오 시퀀스 생성 시스템의 전체 과정을 설명하기 위해 도시한 것이다.1 is a diagram illustrating an entire process of a video sequence generation system using a generative hostile network according to an exemplary embodiment of the present invention.

도 1을 참조하면, 본 발명의 실시예에 따른 생성적 적대 네트워크를 이용한 비디오 시퀀스 생성 시스템은 대상 이미지 데이터(Target image) 및 소스 비디오 데이터(Source video sequence data)를 동적 이동 생성적 적대 네트워크(Dynamics Transfer GAN)에 적용하여 가변 길이의 비디오 시퀀스(Generated video sequence)를 생성한다.Referring to FIG. 1, a video sequence generation system using a generational antagonist network according to an embodiment of the present invention dynamically moves target image data and source video sequence data. Transfer GAN) generates a variable length video sequence.

본 발명의 실시예에 따른 생성적 적대 네트워크를 이용한 비디오 시퀀스 생성 시스템은 도 1에 도시된 바와 같이, 소스 비디오 데이터의 외형(Appearance)을 억제하고, 소스 비디오 데이터의 동적 특성(Source video sequence dynamics)을 대상 이미지 데이터의 외형(Target image appearance)에 적용하여 최종 비디오 시퀀스(Generated video sequence)를 생성한다. As shown in FIG. 1, the video sequence generation system using the generative host network according to an embodiment of the present invention suppresses the appearance of the source video data and controls the source video sequence dynamics. Is applied to the target image appearance to generate a generated video sequence.

이로 인해, 본 발명은 공간적 및 시간적으로 일관된 비디오 시퀀스를 생성하면서 대상 이미지 데이터의 외형(공간적 구조)을 보존하는 효과를 제공한다. As such, the present invention provides the effect of preserving the appearance (spatial structure) of the target image data while creating a spatially and temporally consistent video sequence.

도 2는 본 발명의 실시예에 따른 생성적 적대 네트워크를 이용한 비디오 시퀀스 생성 시스템의 세부 구성을 도시한 것이다.2 illustrates a detailed configuration of a video sequence generation system using a generative antagonist network according to an embodiment of the present invention.

도 2를 참조하면, 본 발명의 실시예에 따른 생성적 적대 네트워크를 이용한 비디오 시퀀스 생성 시스템은 소스 비디오 데이터의 동적 특성(dynamics)과 대상 이미지 데이터의 외형(공간적 구조)을 이용하여 비디오 시퀀스를 생성한다.Referring to FIG. 2, a video sequence generation system using a generative host network according to an embodiment of the present invention generates a video sequence using dynamics of source video data and an appearance (spatial structure) of target image data. do.

이를 위해, 본 발명의 실시예에 따른 생성적 적대 네트워크를 이용한 비디오 시퀀스 생성 시스템(200)은 동적 채널 임베디드부(220), 생성부(230) 및 판별부(240)를 포함한다.To this end, the video sequence generation system 200 using the generative host network according to an embodiment of the present invention includes a dynamic channel embedded unit 220, a generator 230, and a determiner 240.

동적 채널 임베디드부(220)는 소스 비디오 데이터(201,

)에서, 제1 동적 특성 인코더(211)를 통해 시간 동적 특성이 인코딩된 소스 비디오 데이터(201)에 대한 외형 억압 동적 특성을 특정 채널에 삽입한다. The dynamic channel embedded unit 220 may generate source video data 201,

, Inserts a shape suppression dynamic characteristic for the source video data 201 encoded with the temporal dynamic characteristic through a first dynamic characteristic encoder 211 into a specific channel.

예를 들면, 소스 비디오 데이터(201)는 단수 또는 복수의 사용자들 각각의 얼굴 제스처 영상일 수 있으며, 얼굴 제스처뿐만 아니라, 환경, 사물, 인물 등에 대해 시간의 흐름에 따라 획득된 영상 데이터일 수 있다. 또한, 대상 이미지 데이터(202,

)는 단수 또는 복수의 사용자들 각각의 얼굴 제스처 이미지일 수 있으며, 얼굴 제스처뿐만 아니라, 환경, 사물, 인물 등에 대해 획득된 이미지 데이터일 수 있다. For example, the source video data 201 may be a facial gesture image of each of the singular or plural users, and may be image data obtained according to the passage of time with respect to the environment, an object, a person, and the like as well as the face gesture. . In addition, the target image data 202,

) May be a face gesture image of each of the singular or plural users, and may be image data obtained for an environment, an object, a person, and the like as well as the face gesture.

도 2를 참조하면, 제1 동적 특성 인코더(211)는 순환신경망(recurrent neural networks; RNN)을 이용하여 입력되는 소스 비디오 데이터(201,

)에 대한 외형 억압 동적 특성(appearance suppressed dynamics feature,

)을 제공할 수 있다. 이 때, 제1 동적 특성 인코더(211)는 사전 트레이닝된 순환신경망(RNN)에서 소스 비디오 데이터(201)의 공간 인코딩 효과를 제거하여 소스 비디오 데이터(201)에 대한 외형 억압 동적 특성을 추출할 수 있다. Referring to FIG. 2, the first dynamic characteristic encoder 211 may input source video data 201, which is input using recurrent neural networks (RNN).

Appearance suppressed dynamics feature,

) Can be provided. In this case, the first dynamic characteristic encoder 211 may extract the spatially suppressed dynamic characteristic of the source video data 201 by removing the spatial encoding effect of the source video data 201 from the pre-trained cyclic neural network (RNN). have.

이하에서는 도 3을 참조하여 본 발명의 실시예에 따른 동적 특성 인코더에 대해 보다 상세히 설명하고자 한다.Hereinafter, a dynamic characteristic encoder according to an exemplary embodiment of the present invention will be described in detail with reference to FIG. 3.

도 3은 본 발명의 실시예에 따른 동적 특성 인코더의 동작을 설명하기 위해 도시한 것이다.3 is a diagram illustrating the operation of a dynamic characteristic encoder according to an embodiment of the present invention.

이 때, 동적 특성 인코더(210)는 도 2에 도시된 바와 같이, 소스 비디오 데이터에 대한 외형 억압 동적 특성을 제공하는 제1 동적 특성 인코더(211) 및 비디오 시퀀스에 대한 외형 억압 동적 특성을 제공하는 제2 동적 특성 인코더(212)를 나타내며, 동일한 동작을 수행한다. In this case, the dynamic characteristic encoder 210 provides a first dynamic characteristic encoder 211 that provides an external suppression dynamic characteristic for the source video data and a contour suppression dynamic characteristic for the video sequence, as shown in FIG. 2. The second dynamic characteristic encoder 212 is shown and performs the same operation.

도 3을 참조하면, 동적 특성 인코더(210)는 소스 비디오 데이터(Source video sequence data,

)에서, 소스 비디오 데이터의 제1 프레임을 복제하여 정적 시퀀스(Static sequence, with replicas of the first frame,

)를 생성할 수 있다. 이 때, 소스 비디오 데이터(

) 및 정적 시퀀스(

)는 소스 비디오 데이터의 잠재 시공간 특징(Source video spatiotemporal features,

) 및 정적 시퀀스의 잠재 시공간 특징(Static sequence spatiotemporal features,

)을 각각 생성하기 위해 사전 트레이닝된 순환신경망(recurrent neural networks; RNN, 213)에 공급될 수 있다.Referring to FIG. 3, the dynamic characteristic encoder 210 may include source video sequence data,

), The first frame of the source video data by replicating the static sequence (with replicas of the first frame,

) Can be created. At this time, the source video data (

) And static sequence (

) Is the source spatiotemporal features of the source video data.

) And static sequence spatiotemporal features,

) Can be fed to pre-trained recurrent neural networks (RNN 213) to generate each.

이 때, 소스 비디오 데이터(

)는 복수의 프레임(또는 이미지,

)의 결합을 나타내며, 정적 시퀀스(

)는 소스 비디오 데이터(

)의 제1 프레임(

)을 복제하여 생성될 수 있다. At this time, the source video data (

) Is a set of frames (or images,

) Represents a concatenation of

) Is the source video data (

) 'S first frame (

) Can be created by duplicating

일 예로, 순환신경망(RNN)은 사전 트레이닝된 CNNLSTM(Pretrained CNNLSTM, 213)일 수 있으며, CNN(Convolutional Neural Network; 컨볼루션 신경망)과 LSTM(Long ShortTerm Memory)를 이용할 수 있다. For example, the RNN may be a pre-trained CNN_LSTM (213), and a convolutional neural network (CNN) and a long short term memory (LSTM) may be used.

정적 시퀀스(

)는 소스 비디오 데이터(

)와 동일한 프레임의 복제본이므로, CNNLSTM(213)은 시간적 특징이 아닌 잠재 시공간 특징(

)의 공간적 외형만 인코딩할 수 있다. Static sequence (

) Is the source video data (

CNNLSTM 213 is a copy of the same frame as

Can only encode the spatial appearance of.

이에 따라서, 동적 특성 인코더(210)는 소스 비디오 데이터의 잠재 시공간 특징(

)에서 정적 시퀀스의 잠재 시공간 특징(

)을 감산(subtraction)함으로써, 소스 비디오 데이터(201,

)의 공간적 외형을 억제하고, 소스 비디오 데이터(201)의 동적 특성(dynamics)만을 추출하게 된다. 즉, 동적 특성 인코더(210)는 하기의 [수식 1]을 통해 소스 비디오 데이터에 대한 외형 억압 동적 특성(appearance suppressed dynamics feature,

)을 산출할 수 있다. Accordingly, the dynamic characteristic encoder 210 can generate a latent spatiotemporal feature of the source video data.

), The latent space-time feature of a static sequence (

Subtraction), the source video data 201,

Suppress the spatial appearance of), and extract only dynamics of the source video data 201. That is, the dynamic characteristic encoder 210 may use an appearance suppressed dynamics feature for source video data through Equation 1 below.

) Can be calculated.

[수식 1][Equation 1]

다만, 도 3에서는 도 2에 도시된 제1 동적 특성 인코더(211)와 같이, 소스 비디오 데이터(

)를 이용하여 소스 비디오 데이터에 대한 외형 억압 동적 특성(

)을 획득하는 동작을 예를 설명하였으나, 도 2에 도시된 제2 동적 특성 인코더(212)와 같이, 비디오 시퀀스(

)를 이용하여 비디오 시퀀스에 대한 외형 억압 동적 특성(

)을 획득할 수도 있으며, 과정은 동일하다. However, in FIG. 3, like the first dynamic characteristic encoder 211 shown in FIG. 2, source video data (

To suppress the dynamic suppression of the source video data.

The operation of acquiring) is described, but as in the second dynamic characteristic encoder 212 shown in FIG.

Shape suppression dynamic characteristics for video sequences

), And the process is the same.

다시 도 2를 참조하면, 동적 채널 임베디드부(220)는 소스 비디오 데이터(201)에 대한 외형 억압 동적 특성(

)을 특정 채널에 삽입(

)하여 시간 T에서의 소스 비디오 데이터(201)에 대한 외형과 외형 억압 동적 특성을 결합할 수 있다.Referring back to FIG. 2, the dynamic channel embedded unit 220 may have an appearance suppression dynamic characteristic for the source video data 201.

) Into a specific channel (

Can combine the appearance and shape suppression dynamic characteristics for the source video data 201 at time T.

이 때, 동적 채널 임베디드부(220)는 대상 이미지 데이터(202)에 랜덤 노이즈를 추가하는 대신에, 복수의 프레임(또는 레이어, layer)에 적용되는 드롭아웃(dropout)을 이용하여 소스 비디오 데이터(201)에 대한 외형 억압 동적 특성에 노이즈를 생성할 수 있다.In this case, instead of adding random noise to the target image data 202, the dynamic channel embedded unit 220 uses source video data (dropout) applied to a plurality of frames (or layers). Noise may be generated in the shape of the suppression dynamic characteristic.

생성부(230)는 대상 이미지 데이터(202)의 외형과 임베디드된 특정 채널을 결합하여 가변 길이의 비디오 시퀀스(203)를 생성한다. The generation unit 230 generates a video sequence 203 having a variable length by combining the appearance of the target image data 202 and a specific embedded channel.

이 때, 생성부(230)는 유넷 네트워크(Unet Network) 구조일 수 있다. 생성부(230)는 유넷 네트워크 구조를 사용하여 대상 이미지 데이터(202)의 외형을 보존할 수 있으며, 동적 채널 임베디드부(220) 및 생성부(230)를 포함하는 생성기 네트워크의 세부 사항을 보존할 수도 있다. At this time, the generation unit 230 may have a U? Net network structure. The generator 230 may preserve the appearance of the target image data 202 by using a unite network structure, and may preserve details of the generator network including the dynamic channel embedded unit 220 and the generator 230. It may be.

생성부(230)는 대상 이미지 데이터(202)의 외형(

)과 임베디드된 특정 채널(

)을 결합하여 가변 길이의 비디오 시퀀스(203, Generated video sequence,

)를 생성할 수 있다. 이로 인해, 생성부(230)는 소스 비디오 데이터(201,

)를 가변 길이의 비디오 시퀀스(203,

)로 생성할 수 있다. The generation unit 230 is an external shape of the target image data 202 (

) And specific embedded channels (

) To generate a variable length video sequence (203, Generated video sequence,

) Can be created. As a result, the generation unit 230 generates the source video data 201,

) Is a variable length video sequence (203,

Can be created with).

나아가, 생성부(230)는 판별부(240)를 통해 판별되는 비디오 시퀀스(203)에 대한 공간적 및 시간적 일관성에 기초하여 최종 비디오 시퀀스를 생성할 수 있다. 보다 상세하게는, 생성부(230)는 대상 이미지 데이터(202)의 외형과 소스 비디오 데이터(201)의 동적 특성(예를 들면, 임베디드된 특정 채널)을 결합하여 비디오 시퀀스(203)를 생성하고, 생성된 비디오 시퀀스(203)는 판별부(240)를 통해 판별될 수 있다. 이후, 생성부(230)는 판별부(240)로 인한 판별 결과에 기초하여, 판별부(240)가 실제 비디오라고 착각할 수 있도록 비디오 시퀀스(203)의 세부 작업을 반복할 수 있다. 이에 따라서, 생성부(230)는 판별부(240)와의 과정 수행을 반복함으로써, 결과적으로 품질이 향상된 최종 비디오 시퀀스를 생성할 수 있다. Furthermore, the generator 230 may generate the final video sequence based on the spatial and temporal coherence of the video sequence 203 determined by the determiner 240. More specifically, the generation unit 230 generates the video sequence 203 by combining the appearance of the target image data 202 and the dynamic characteristics of the source video data 201 (eg, a specific embedded channel). The generated video sequence 203 may be determined by the determination unit 240. Thereafter, the generation unit 230 may repeat the detailed operation of the video sequence 203 so that the determination unit 240 may be mistaken as the actual video based on the determination result by the determination unit 240. Accordingly, the generation unit 230 may repeat the process with the determination unit 240, thereby generating a final video sequence with improved quality.

판별부(240)는 소스 비디오 데이터(201)와 비디오 시퀀스(203) 사이의 프레임 정확도를 구별하고, 소스 비디오 데이터(201)에 대한 외형 억압 동적 특성과 비디오 시퀀스(203)에 대한 외형 억압 동적 특성 사이의 동적 특성 일관성을 구별한다. 판별부(240)는 소스 비디오 데이터(201)와 비디오 시퀀스(203)를 구분할 수 있도록 학습된 것일 수 있다.The determination unit 240 distinguishes the frame accuracy between the source video data 201 and the video sequence 203, and determines the appearance suppression dynamic characteristics of the source video data 201 and the appearance suppression dynamic characteristics of the video sequence 203. Distinguish between dynamic property consistency. The determination unit 240 may be trained to distinguish the source video data 201 and the video sequence 203.

보다 구체적으로, 판별부(240)는 소스 비디오 데이터(201)와 비디오 시퀀스(203) 사이의 프레임 정확도(Rea/fake frame)를 구별하는 공간 판별부(241) 및 소스 비디오 데이터(201)에 대한 외형 억압 동적 특성과 비디오 시퀀스(203)에 대한 외형 억압 동적 특성 사이의 동적 특성 일관성(Real/fake sequence)을 구별하는 동적 판별부(242)를 포함할 수 있다.More specifically, the determination unit 240 may determine the spatial determination unit 241 and the source video data 201 that distinguish the frame accuracy (Rea / fake frame) between the source video data 201 and the video sequence 203. It may include a dynamic discrimination unit 242 that distinguishes a dynamic characteristic consistency (Real / fake sequence) between the appearance suppression dynamic characteristics and the appearance suppression dynamic characteristics for the video sequence 203.

공간 판별부(241)는 비디오 시퀀스(203)의 프레임에 대한 정확도를 확인하며, 소스 비디오 데이터(201)의 프레임과 비디오 시퀀스(203)의 프레임을 비교하여 비디오 시퀀스(203)의 프레임이 실제 프레임 또는 생성된(가짜) 프레임인지 구별할 수 있다. The spatial determiner 241 checks the accuracy of the frame of the video sequence 203 and compares the frame of the source video data 201 with the frame of the video sequence 203 so that the frame of the video sequence 203 is an actual frame. Or you can tell if it's a generated (fake) frame.

예를 들면, 공간 판별부(241)는 컨볼루션 네트워크 스택과 출력 레이어로 구성된 구조일 수 있으며, 소스 비디오 데이터(201) 및 비디오 시퀀스(203)를 시간 T에서의 복수의 프레임으로 구분하고, 프레임에 대한 정확도를 판단하며, 프레임이 실제 프레임(소스 비디오 데이터(201)의 프레임)인지, 생성된 프레임(비디오 시퀀스(203)의 프레임)인지를 구별할 수 있다. For example, the spatial determiner 241 may have a structure composed of a convolutional network stack and an output layer, and divides the source video data 201 and the video sequence 203 into a plurality of frames at time T, and It is possible to determine whether the frame is an actual frame (a frame of the source video data 201) or a generated frame (a frame of the video sequence 203).

동적 판별부(242)는 제2 동적 특성 인코더(212)를 통해 획득되는 비디오 시퀀스(203)에 대한 외형 억압 동적 특성과, 제1 동적 특성 인코더(211)를 통해 획득되는 소스 비디오 데이터(201)에 대한 외형 억압 동적 특성을 비교하여 비디오 시퀀스의 동적 특성이 현실 동적 특성 또는 생성된(가짜) 동적 특성인지 구별할 수 있다. The dynamic determination unit 242 may be configured to perform an appearance suppression dynamic characteristic on the video sequence 203 obtained through the second dynamic characteristic encoder 212 and source video data 201 obtained through the first dynamic characteristic encoder 211. Appearance suppression dynamic characteristics for can be compared to distinguish whether the dynamic characteristics of the video sequence are real or generated (fake) dynamic characteristics.

예를 들면, 제2 동적 특성 인코더(212)는 순환신경망(recurrent neural networks; RNN)을 이용하여 입력되는 비디오 시퀀스(203,

)에 대한 외형 억압 동적 특성(Generated appearance suppressed dynamics feature,

)를 제공할 수 있다. 이 때, 제2 동적 특성 인코더(212)는 사전 트레이닝된 순환신경망(RNN)에서 비디오 시퀀스(203)의 공간 인코딩 효과를 제거하여 비디오 시퀀스(203)에 대한 외형 억압 동적 특성을 추출할 수 있다.For example, the second dynamic characteristic encoder 212 may input a video sequence 203 that is input using recurrent neural networks (RNN).

Generated appearance suppressed dynamics feature

) Can be provided. In this case, the second dynamic characteristic encoder 212 may extract the spatial suppression dynamic characteristic of the video sequence 203 by removing the spatial encoding effect of the video sequence 203 from the pre-trained cyclic neural network (RNN).

도 3에서 설명된 세부 사항과 동일하게, 제2 동적 특성 인코더(212)는 CNNLSTM(213)를 사용하여 정적 시퀀스(

)로부터 생성된 정적 시퀀스의 잠재 시공간 특징(

)을 획득하고, 비디오 시퀀스의 잠재 시공간 특징(

)에서 정적 시퀀스의 잠재 시공간 특징(

)을 감산(subtraction)함으로써, 비디오 시퀀스(203,

)의 공간적 외형을 억제하고, 비디오 시퀀스(203)의 동적 특성(dynamics)만을 추출하게 된다. 즉, 제2 동적 특성 인코더(212)는 하기의 [수식 2]를 통해 비디오 시퀀스에 대한 외형 억압 동적 특성(generated appearance suppressed dynamics feature,

)을 산출할 수 있다.Similar to the details described in FIG. 3, the second dynamic characteristic encoder 212 uses the CNNLSTM 213 to generate a static sequence (

Latent space-time features of static sequences generated from

) And the latent space-time feature of the video sequence (

), The latent space-time feature of a static sequence (

Subtraction), the video sequence 203,

) To suppress the spatial appearance and extract only the dynamics of the video sequence 203. That is, the second dynamic characteristic encoder 212 may use the generated appearance suppressed dynamics feature for the video sequence through Equation 2 below.

) Can be calculated.

[수식 2][Formula 2]

이로 인해, 동적 판별부(242)는 제2 동적 특성 인코더(212)를 통해 획득되는 비디오 시퀀스(203)에 대한 외형 억압 동적 특성과, 제1 동적 특성 인코더(211)를 통해 획득되는 소스 비디오 데이터(201)에 대한 외형 억압 동적 특성을 비교함으로써, 비디오 시퀀스(203)에 대한 외형 억압 동적 특성이 현실 동적 특성(소스 비디오 데이터(201)의 동적 특성,

)인지, 생성된 동적 특성(비디오 시퀀스(203)의 동적 특성,

)인지를 구별할 수 있다. As a result, the dynamic determination unit 242 may determine the appearance suppression dynamic characteristics of the video sequence 203 obtained through the second dynamic characteristic encoder 212, and the source video data obtained through the first dynamic characteristic encoder 211. By comparing the contour suppression dynamic characteristics for 201, the contour suppression dynamic characteristics for the video sequence 203 may be compared to the reality dynamic characteristics (dynamic characteristics of the source video data 201,

), Generated dynamic characteristics (dynamic characteristics of video sequence 203,

) Can be distinguished.

도 2를 참조하면, 공간 판별부(241)는 비디오 시퀀스(203)를 일련의 프레임으로 검토하며, 동적 판별부(242)는 비디오 시퀀스(203)를 입력 공간에서 샘플로 검토한다. 구체적으로, 비디오 시퀀스(203)는 가변 길이를 가질 수 있다. 동적 판별부(242)가 비디오 시퀀스(203) 길이에 관계없이 전체 시퀀스를 샘플 포인트로 처리하려면, 동적 판별부(242)의 입력 크기가 시퀀스의 길이에 영향을 받지 않아야 한다. 이에 따라서, 동적 판별부(242)는 시간 T에서 비디오 시퀀스(203)의 동적 특성(

)만을 이용하여 동적 특성 일관성을 구별할 수 있다. 이 때, 비디오 시퀀스(203)의 동적 특성(

)은 전체 시퀀스의 동적 특성 즉, 처음부터 시간 T까지의 동적 특성(dynamics)을 나타낸다. Referring to FIG. 2, the spatial discriminator 241 examines the video sequence 203 as a series of frames, and the dynamic discriminator 242 examines the video sequence 203 as a sample in the input space. In detail, the video sequence 203 may have a variable length. In order for the dynamic determiner 242 to process the entire sequence as sample points regardless of the length of the video sequence 203, the input size of the dynamic determiner 242 should not be affected by the length of the sequence. Accordingly, the dynamic discriminator 242 determines the dynamic characteristics of the video sequence 203 at time T.

) Can be used to distinguish dynamic property consistency. At this time, the dynamic characteristics of the video sequence 203

) Represents the dynamic characteristics of the entire sequence, i.e., dynamics from the beginning to time T.

이로 인해, 본 발명의 실시예에 따른 생성적 적대 네트워크를 이용한 비디오 시퀀스 생성 시스템(200)은 대상 이미지 데이터(202)의 현실적인 공간적 구조(외형)와 소스 비디오 데이터(201)를 모방한 시간적 동적 특성을 갖는 비디오 시퀀스(203)를 생성한다. As a result, the video sequence generation system 200 using the generative antagonist network 200 according to an embodiment of the present invention has a temporal dynamic characteristic that mimics the realistic spatial structure (appearance) of the target image data 202 and the source video data 201. Generates a video sequence 203 having

본 발명의 다른 실시예에 따른 생성적 적대 네트워크를 이용한 비디오 시퀀스 생성 시스템(200)은 도 2에 도시된, 동적 채널 임베디드부(220) 및 생성부(230)를 포함하는 생성기 네트워크(미도시)와, 공간 판별부(241) 및 동적 판별부(242)를 포함하는 판별기 네트워크(미도시)를 포함한다.The video sequence generation system 200 using the generative antagonist network 200 according to another embodiment of the present invention includes a generator network (not shown) including a dynamic channel embedded unit 220 and a generator 230 shown in FIG. 2. And a discriminator network (not shown) including a spatial discriminating unit 241 and a dynamic discriminating unit 242.

생성기 네트워크는 소스 비디오 데이터(201)의 동적 특성과 대상 이미지 데이터(202)의 외형을 이용하여 가변 길이의 비디오 시퀀스(203)를 생성하고, 판별기 네트워크로 인한 판별 결과에 기초하여 최종 비디오 시퀀스를 생성한다. The generator network generates a variable length video sequence 203 using the dynamic characteristics of the source video data 201 and the appearance of the target image data 202, and generates a final video sequence based on the determination result by the discriminator network. Create

이 때, 생성기 네트워크는 소스 비디오 데이터(201)에서, 제1 동적 특성 인코더(211)를 통해 시간 동적 특성이 인코딩된 소스 비디오 데이터(201)에 대한 외형 억압 동적 특성을 특정 채널에 삽입하는 동적 채널 임베디드부(220) 및 대상 이미지 데이터(202)의 외형과 임베디드된 특정 채널을 결합하여 가변 길이의 비디오 시퀀스(203)를 생성하는 생성부(230)를 포함할 수 있다. At this time, the generator network inserts, in the source video data 201, a dynamic channel for inserting a shape suppression dynamic characteristic for the source video data 201 encoded with the temporal dynamic characteristic through the first dynamic characteristic encoder 211 into a specific channel. It may include a generator 230 for combining the appearance of the embedded unit 220 and the target image data 202 and a specific embedded channel to generate a video sequence 203 of variable length.

판별기 네트워크는 소스 비디오 데이터(201)와 비디오 시퀀스(203) 사이의 프레임 정확도를 구별하고, 소스 비디오 데이터(201)에 대한 외형 억압 동적 특성과 비디오 시퀀스(203)에 대한 외형 억압 동적 특성 사이의 동적 특성 일관성을 구별한다. The discriminator network distinguishes frame accuracy between the source video data 201 and the video sequence 203, and distinguishes between an appearance suppression dynamic characteristic for the source video data 201 and an appearance suppression dynamic characteristic for the video sequence 203. Distinguish dynamic property consistency.

이 때, 판별기 네트워크는 소스 비디오 데이터(201)와 비디오 시퀀스(203) 사이의 프레임 정확도를 구별하는 공간 판별부(241) 및 소스 비디오 데이터(201)에 대한 외형 억압 동적 특성과 비디오 시퀀스(203)에 대한 외형 억압 동적 특성 사이의 동적 특성 일관성을 구별하는 동적 판별부(242)를 포함할 수 있다. At this time, the discriminator network determines the spatial accuracy of the frame accuracy between the source video data 201 and the video sequence 203 and the visual suppression dynamic characteristics of the source video data 201 and the video sequence 203. Dynamic discrimination unit 242 that distinguishes the dynamic characteristic consistency between the shape suppression dynamic characteristics.

본 발명의 다른 실시예에 따른 생성적 적대 네트워크를 이용한 비디오 시퀀스 생성 시스템(200)의 생성기 네트워크 및 판별기 네트워크는 교차적으로 트레이닝될 수 있다. 예를 들면, 판별기 네트워크는 손실 조건을 최대화하여 트레이닝될 수 있고, 생성기 네트워크는 판별기 네트워크 다음에 업데이트될 수 있으며, 생성적 부분이 적대 손실을 최소화하여 트레이닝될 수 있다. The generator network and the discriminator network of the video sequence generation system 200 using the generative antagonist network according to another embodiment of the present invention may be cross-trained. For example, the discriminator network may be trained to maximize loss conditions, the generator network may be updated after the discriminator network, and the productive portion may be trained to minimize hostile losses.

본 발명의 다른 실시예에 따른 생성적 적대 네트워크를 이용한 비디오 시퀀스 생성 시스템(200)은 생성기 네트워크에 재구성 조건을 추가하여 비디오 시퀀스의 품질을 향상시킬 수 있으며, 비디오 시퀀스에 대한 외형 억압 동적 특성에 재구성 조건을 적용시켜 시퀀스 길이에 무관하게 동적 특성(dynamics)의 일관성을 유지할 수도 있다. The video sequence generation system 200 using the generative antagonist network according to another embodiment of the present invention may improve the quality of the video sequence by adding a reconstruction condition to the generator network, and reconstruct the appearance suppression dynamic characteristics of the video sequence. Conditions can also be applied to maintain consistency of dynamics regardless of sequence length.

도 4는 본 발명의 실시예에 따른 생성적 적대 네트워크를 이용한 비디오 시퀀스 생성 방법의 흐름도를 도시한 것이다.4 is a flowchart of a video sequence generation method using a generative antagonist network according to an embodiment of the present invention.

도 4의 각 단계는 도 2에 도시된 본 발명의 실시예에 따른 생성적 적대 네트워크를 이용한 비디오 시퀀스 생성 시스템에 의해 수행된다.Each step of FIG. 4 is performed by a video sequence generation system using a generative hostile network according to the embodiment of the present invention shown in FIG.

도 4를 참조하면, 단계 410에서, 동적 채널 임베디드부는 동적 특성 인코더를 통해 시간 동적 특성이 인코딩된 소스 비디오 데이터에 대한 외형 억압 동적 특성을 특정 채널에 삽입한다.Referring to FIG. 4, in step 410, the dynamic channel embedded unit inserts a shape suppression dynamic characteristic for source video data encoded with a temporal dynamic characteristic through a dynamic characteristic encoder into a specific channel.

단계 410은 소스 비디오 데이터에 대한 외형 억압 동적 특성을 특정 채널에 삽입하여 시간 T에서의 소스 비디오 데이터에 대한 외형과 외형 억압 동적 특성을 결합하는 단계일 수 있다. Step 410 may be a step of inserting an appearance suppression dynamic characteristic of the source video data into a specific channel to combine the appearance suppression dynamic characteristic of the source video data at time T.

이 때, 단계 410에서 동적 채널 임베디드부는 대상 이미지 데이터에 랜덤 노이즈를 추가하는 대신에, 복수의 프레임(또는 레이어, layer)에 적용되는 드롭아웃(dropout)을 이용하여 소스 비디오 데이터에 대한 외형 억압 동적 특성에 노이즈를 생성할 수 있다. At this time, instead of adding random noise to the target image data in step 410, the dynamic channel embedded unit dynamically suppresses the appearance of the source video data by using a dropout applied to a plurality of frames (or layers). Noise can be generated on the characteristic.

단계 420에서, 생성부는 대상 이미지 데이터의 외형과 임베디드된 특정 채널을 결합하여 가변 길이의 비디오 시퀀스를 생성한다. In operation 420, the generation unit combines the appearance of the target image data and the embedded specific channel to generate a variable length video sequence.

이 때, 생성부는 유넷 네트워크(Unet Network) 구조일 수 있으며, 단계 420은 유넷 네트워크 구조를 이용하여, 대상 이미지 데이터의 외형을 보존하는 단계일 수 있다. In this case, the generation unit may be a Unet network structure, and step 420 may be a step of preserving the appearance of the target image data using the Unet network structure.

단계 430에서, 판별부는 소스 비디오 데이터와 비디오 시퀀스 사이의 프레임 정확도를 구별하고, 소스 비디오 데이터에 대한 외형 억압 동적 특성과 비디오 시퀀스에 대한 외형 억압 동적 특성 사이의 동적 특성 일관성을 구별한다. In step 430, the discriminator distinguishes frame accuracy between the source video data and the video sequence, and distinguishes the dynamic characteristic consistency between the contour suppression dynamic characteristic for the source video data and the contour suppression dynamic characteristic for the video sequence.

단계 430은 소스 비디오 데이터의 프레임과 비디오 시퀀스의 프레임을 비교하여 비디오 시퀀스의 프레임이 실제 프레임 또는 생성된(가짜) 프레임인지 구별하는 단계일 수 있다.Step 430 may be comparing the frames of the source video data with the frames of the video sequence to distinguish whether the frames of the video sequence are actual frames or generated (fake) frames.

예를 들면, 상기 프레임을 구별하는 단계는 컨볼루션 네트워크 스택과 출력 레이어로 구성된 구조를 이용하여, 소스 비디오 데이터 및 비디오 시퀀스를 시간 T에서의 복수의 프레임으로 구분하고, 프레임에 대한 정확도를 판단하며, 프레임이 실제 프레임(소스 비디오 데이터의 프레임)인지, 생성된 프레임(비디오 시퀀스의 프레임)인지를 구별하는 단계일 수 있다. For example, the step of distinguishing the frames may be performed by using a structure consisting of a convolutional network stack and an output layer, dividing the source video data and the video sequence into a plurality of frames at time T, and determining the accuracy of the frames. The method may be a step of discriminating whether a frame is an actual frame (frame of source video data) or a generated frame (frame of a video sequence).

또한, 단계 430은 동적 특성 인코더를 통해 획득되는 비디오 시퀀스에 대한 외형 억압 동적 특성과, 소스 비디오 데이터에 대한 외형 억압 동적 특성을 비교하여 비디오 시퀀스에 대한 외형 억압 동적 특성이 현실 동적 특성 또는 생성된(가짜) 동적 특성인지 구별하는 단계일 수 있다. In addition, step 430 compares the appearance suppression dynamic characteristic of the video sequence obtained through the dynamic characteristic encoder with the appearance suppression dynamic characteristic of the source video data, thereby generating a realistic dynamic characteristic or generated ( Fake) dynamic characteristics.

예를 들면, 상기 동적 특성을 구별하는 단계는 동적 특성 인코더를 통해 획득되는 비디오 시퀀스에 대한 외형 억압 동적 특성과, 소스 비디오 데이터에 대한 외형 억압 동적 특성을 비교함으로써, 비디오 시퀀스에 대한 외형 억압 동적 특성이 현실 동적 특성(소스 비디오 데이터의 동적 특성)인지, 생성된 동적 특성(비디오 시퀀스의 동적 특성)인지를 구별하는 단계일 수 있다. For example, the step of distinguishing the dynamic characteristics may include the appearance suppression dynamic characteristics of the video sequence by comparing the appearance suppression dynamic characteristics of the video sequence obtained through the dynamic characteristic encoder with the appearance suppression dynamic characteristics of the source video data. It may be a step of distinguishing whether the real dynamic characteristic (dynamic characteristic of the source video data) or the generated dynamic characteristic (dynamic characteristic of the video sequence).

단계 440에서, 생성부는 구별 결과에 기초하여, 비디오 시퀀스에 대한 공간적 및 시간적 일관성에 기초하여 최종 비디오 시퀀스를 출력한다. 예를 들면, 단계 430에서 판별부로 인한 판별 결과에 기초하여, 단계 440에서 생성부는 결과적으로 품질이 향상된 최종 비디오 시퀀스를 생성할 수 있다. In step 440, the generation unit outputs the final video sequence based on the spatial and temporal consistency for the video sequence based on the discrimination result. For example, based on the determination result by the determination unit in step 430, in step 440, the generation unit may generate a final video sequence with improved quality.

본 발명의 실시예에 따른 생성적 적대 네트워크를 이용한 비디오 시퀀스 생성 방법은 단계 420과 단계 430을 반복 학습함으로써, 최종적으로 단계 440을 통해 품질이 향상된 최종 비디오 시퀀스를 생성하는 것을 특징으로 한다. The video sequence generation method using the generative antagonist network according to the embodiment of the present invention is characterized by generating the final video sequence having the improved quality through step 440 by repeatedly learning steps 420 and 430.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 어플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The apparatus described above may be implemented as a hardware component, a software component, and / or a combination of hardware components and software components. For example, the devices and components described in the embodiments may be, for example, processors, controllers, arithmetic logic units (ALUs), digital signal processors, microcomputers, field programmable arrays (FPAs), It may be implemented using one or more general purpose or special purpose computers, such as a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to the execution of the software. For convenience of explanation, one processing device may be described as being used, but one of ordinary skill in the art will appreciate that the processing device includes a plurality of processing elements and / or a plurality of types of processing elements. It can be seen that it may include. For example, the processing device may include a plurality of processors or one processor and one controller. In addition, other processing configurations are possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may include a computer program, code, instructions, or a combination of one or more of the above, and configure the processing device to operate as desired, or process it independently or collectively. You can command the device. Software and / or data may be any type of machine, component, physical device, virtual equipment, computer storage medium or device in order to be interpreted by or to provide instructions or data to the processing device. Or may be permanently or temporarily embodied in a signal wave to be transmitted. The software may be distributed over networked computer systems so that they may be stored or executed in a distributed manner. Software and data may be stored on one or more computer readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiment may be embodied in the form of program instructions that can be executed by various computer means and recorded in a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the media may be those specially designed and constructed for the purposes of the embodiments, or they may be of the kind well-known and available to those having skill in the computer software arts. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tape, optical media such as CD-ROMs, DVDs, and magnetic disks, such as floppy disks. Magneto-optical media, and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine code generated by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like. The hardware device described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.Although the embodiments have been described by the limited embodiments and the drawings as described above, various modifications and variations are possible to those skilled in the art from the above description. For example, the described techniques may be performed in a different order than the described method, and / or components of the described systems, structures, devices, circuits, etc. may be combined or combined in a different form than the described method, or other components. Or even if replaced or substituted by equivalents, an appropriate result can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are within the scope of the claims that follow.

200: 생성적 적대 네트워크를 이용한 비디오 시퀀스 생성 시스템
201: 소스 비디오 데이터
202: 대상 이미지 데이터
203: 비디오 시퀀스
210, 211, 212: 동적 특성 인코더
213: 트레이닝된 CNNLSTM(또는 순환신경망(RNN))
220: 동적 채널 임베디드부
230: 생성부
241: 공간 판별부
242: 동적 판별부200: video sequence generation system using a generative hostile network
201: source video data
202: target image data
203 video sequence
210, 211, 212: Dynamic Characteristic Encoders
213: trained CNNLSTM (or circulatory neural network (RNN))
220: dynamic channel embedded part
230: generation unit
241: space determination unit
242: dynamic determination unit

Claims

A dynamic channel embedded unit for inserting a shape suppression dynamic characteristic of the source video data encoded with the temporal dynamic characteristic through the dynamic characteristic encoder into the specific channel in the source video data;
A generator configured to combine an external shape of target image data and the embedded specific channel to generate a video sequence having a variable length; And
A discriminating unit that discriminates frame accuracy between the source video data and the video sequence, and distinguishes a dynamic characteristic consistency between an appearance suppression dynamic characteristic for the source video data and an appearance suppression dynamic characteristic for the video sequence
Video sequence generation system using a generative hostile network comprising a.

The method of claim 1,
The dynamic channel embedded unit
And inserting a shape suppression dynamic characteristic for the source video data into the specific channel to combine the shape and dynamic shape suppression dynamic characteristic for the source video data at time T.

The method of claim 2,
The dynamic channel embedded unit
A video sequence generation system using a generative antagonist network, characterized in that noise is generated in an external suppression dynamic characteristic of the source video data using dropout.

The method of claim 1,
The generation unit
A video sequence generation system using a generative antagonist network that preserves the appearance of the target image data by using a Unet network structure.

The method of claim 4, wherein
The generation unit
And a final video sequence is generated based on spatial and temporal consistency with respect to the video sequence determined by the determination unit.

The method of claim 1,
The determining unit
A spatial discriminating unit that discriminates frame accuracy between the source video data and the video sequence; And
Dynamic discrimination unit for distinguishing the dynamic characteristic consistency between the appearance suppression dynamic characteristics of the source video data and the appearance suppression dynamic characteristics of the video sequence
Video sequence generation system using a generative hostile network comprising a.

The method of claim 6,
The space determination unit
And a frame in which the frame of the video sequence is compared to the frame of the source video data to distinguish whether the frame of the video sequence is a real frame or a generated (fake) frame.

The method of claim 6,
The dynamic determination unit
The contour suppression dynamic characteristic for the video sequence obtained by the dynamic characteristic encoder is compared with the contour suppression dynamic characteristic for the source video data, and the contour suppression dynamic characteristic for the video sequence is a real dynamic characteristic or a generated (fake). Video sequence generation system using a generative hostile network that distinguishes between dynamic characteristics.

The method of claim 8,
The dynamic determination unit
And to distinguish dynamic characteristic consistency using only the shape suppression dynamic characteristic for the video sequence up to time T, so as not to be affected by the variable length of the video sequence.

The method of claim 6,
The dynamic characteristic encoder
Replicating the source video data or the first frame of the video sequence to generate a static sequence, and using a recurrent neural networks (RNN) of the latent spatiotemporal feature of the source video data or the video sequence and the Acquire latent spatiotemporal features and remove latent spatiotemporal features of the static sequence from the source video data or latent spatiotemporal features of the video sequence so as to suppress the appearance suppression dynamics for the source video data or the shape suppression dynamics for the video sequence Video sequence generation system using a generative hostile network that provides a.

A generator network for generating a video sequence of variable length using the dynamic characteristics of the source video data and the appearance of the target image data, and generating a final video sequence based on the determination result; And
A discriminator network that distinguishes frame accuracy between the source video data and the video sequence and that distinguishes a dynamic characteristic consistency between an appearance suppression dynamic characteristic for the source video data and an appearance suppression dynamic characteristic for the video sequence,
And the generator network and the discriminator network are alternately performed.

The method of claim 11,
The generator network
A dynamic channel embedded unit for inserting a shape suppression dynamic characteristic for the source video data encoded with a temporal dynamic characteristic through a dynamic characteristic encoder into the specific channel in the source video data; And
A generation unit for combining the appearance of the target image data and the embedded specific channel to generate the video sequence of a variable length
Video sequence generation system using a generative hostile network comprising a.

The method of claim 11,
The discriminator network
A spatial discriminating unit that discriminates frame accuracy between the source video data and the video sequence; And
Dynamic discrimination unit for distinguishing the dynamic characteristic consistency between the appearance suppression dynamic characteristics of the source video data and the appearance suppression dynamic characteristics of the video sequence
Video sequence generation system using a generative hostile network comprising a.

A method of operating a video sequence generation system using a generative hostile network,
Inserting a shape suppression dynamic characteristic for source video data encoded with a temporal dynamic characteristic into a specific channel through a dynamic characteristic encoder;
Generating a variable length video sequence by combining an appearance of target image data and the embedded specific channel;
Distinguishing frame accuracy between the source video data and the video sequence, and distinguishing a dynamic characteristic consistency between an appearance suppression dynamic characteristic for the source video data and an appearance suppression dynamic characteristic for the video sequence; And
Based on the discrimination result, outputting a final video sequence based on spatial and temporal coherence for the video sequence
Video sequence generation method using a generative hostile network comprising a.

The method of claim 14,
Inserting the source video data into a specific channel
And inserting a shape suppression dynamic characteristic of the source video data into the specific channel to combine the shape and dynamic shape suppression dynamic characteristic of the source video data at time T.

The method of claim 14,
Generating the variable length video sequence
A video sequence generation method using a generative antagonist network that preserves the appearance of the target image data by using a Unet network structure.

The method of claim 14,
Distinguishing the accuracy of the frame and distinguishing dynamic characteristic consistency
Comparing the frame of the source video data with the frame of the video sequence to distinguish whether the frame of the video sequence is a real frame or a generated (fake) frame
Video sequence generation method using a generative hostile network comprising a.

The method of claim 14,
Distinguishing the accuracy of the frame and distinguishing dynamic characteristic consistency
The contour suppression dynamic characteristic for the video sequence obtained by the dynamic characteristic encoder is compared with the contour suppression dynamic characteristic for the source video data, and the contour suppression dynamic characteristic for the video sequence is a real dynamic characteristic or a generated (fake). Steps to distinguish if they are dynamic
Video sequence generation method using a generative hostile network comprising a.

A computer program stored in a computer readable recording medium for performing the method of any one of claims 14-18.