KR102313160B1

KR102313160B1 - Method and apparatus for generating high-resolution video using motion information

Info

Publication number: KR102313160B1
Application number: KR1020200063777A
Authority: KR
Inventors: 임수철; 고대관; 이동한
Original assignee: 동국대학교 산학협력단
Priority date: 2019-05-31
Filing date: 2020-05-27
Publication date: 2021-10-19
Also published as: KR20200138668A

Abstract

동작 정보를 이용한 고해상도 동영상 생성 방법 및 장치가 개시된다. 동영상 생성 방법은 고해상도의 원본 동영상에 기초한 저해상도의 전송 동영상을 수신하는 단계; 상기 전송 동영상에 포함된 오브젝트들 간의 상호 작용과 관련된 제어 장치의 동작 정보, 및 상기 오브젝트들, 또는 제어 장치의 위치 정보를 수신하는 단계; 상기 전송 동영상과 상기 동작 정보 및 상기 위치 정보를 이용하여 상기 원본 동영상에 포함되어 있으나 상기 전송 동영상에 누락된 프레임에서의 상기 오브젝트들 간의 상호 작용 힘(Interaction force)의 변화에 대한 예측 이미지들을 생성하는 단계; 및 상기 예측 이미지들 중에 진짜 이미지(real image)를 판별하여 상기 원본 동영상에 대응되는 고해상도의 출력 동영상을 생성하는 단계를 포함할 수 있다.Disclosed are a method and apparatus for generating a high-resolution video using motion information. The video generation method includes: receiving a low-resolution transmission video based on a high-resolution original video; receiving motion information of a control device related to an interaction between objects included in the transmission video and location information of the objects or the control device; Generating prediction images for a change in interaction force between the objects in the frame included in the original video but missing in the transmitted video using the transmitted video, the motion information, and the location information step; and generating a high-resolution output video corresponding to the original video by determining a real image from among the predicted images.

Description

Method and apparatus for generating high-resolution video using motion information

본 발명은 동영상 생성 장치 및 방법에 관한 것으로, 보다 구체적으로는 동작 정보를 이용하여 저해상도의 전송 동영상으로부터 고해상도 동영상을 생성하는 방법 및 장치에 관한 것이다.The present invention relates to an apparatus and method for generating a moving picture, and more particularly, to a method and apparatus for generating a high-resolution moving image from a low-resolution transmission moving image by using motion information.

시스템의 네트워크 부하를 절감하기 위하여 카메라에서 생성된 고해상도의 원본 동영상에서 일부 프레임을 추출한 저해상도의 동영상을 전송하고, 수신 장치가 수신한 저해상도의 동영상에서 원본 동영상과 동일한 고해상도의 출력 동영상을 생성하는 기술이 개발되었다.In order to reduce the network load of the system, a technology that transmits a low-resolution video by extracting some frames from the high-resolution original video generated by the camera, and generates an output video with the same high resolution as the original video from the low-resolution video received by the receiving device. was developed

종래의 고해상도 동영상 생성 기술은 저해상도 영상에 포함된 프레임들만으로 전송되지 않은 프레임들을 예측하여 고해상도 동영상을 생성하고 있으므로, 예측 정확도에 따라 고해상도 영상의 손실 여부가 결정된다. 그러나, 예측 정확도는 예측하는 시간에 반비례하므로, 예측해야 하는 시간이 증가할수록 예측 정확도가 감소하여 생성된 고해상도 영상의 손실도 증가하는 문제가 있었다.Since the conventional high-resolution video generation technology generates a high-resolution video by predicting frames that are not transmitted using only frames included in the low-resolution video, it is determined whether the high-resolution video is lost according to prediction accuracy. However, since the prediction accuracy is inversely proportional to the prediction time, as the prediction time increases, the prediction accuracy decreases and the loss of the generated high-resolution image also increases.

또한, 카메라에서 동영상을 생성한 후, 수신 장치로 전송하면, 동영상이 전송되는 시간에 의하여 지연이 발생하여 수신 장치가 수신한 동영상은 일정 시간 전에 카메라에서 촬영된 영상일 수 있다. 즉, 수신 장치가 수신한 동영상은 이전 시간의 동영상이므로, 수신 장치가 현재 시간의 동영상을 표시하지 못하는 실정이다.In addition, when a video is generated by the camera and transmitted to the receiving device, a delay occurs due to the transmission time of the video, and the video received by the receiving device may be an image captured by the camera before a predetermined time. That is, since the moving picture received by the receiving device is a moving picture of the previous time, the receiving device cannot display the moving picture of the current time.

따라서, 수신한 동영상을 이용하여 현재 시간의 동영상 및 고해상도 동영상을 생성하는 방법이 요청되고 있다.Accordingly, there is a demand for a method of generating a video of the current time and a high-resolution video using a received video.

본 발명은 오브젝트를 제어하는 제어 장치의 동작 정보, 및 오브젝트들, 또는 제어 장치의 위치 정보를 이용하여 고해상도의 원본 동영상의 프레임들 중 일부로 구성된 저해상도의 전송 동영상으로부터 고해상도의 출력 동영상을 생성하는 장치 및 방법을 제공할 수 있다.The present invention provides an apparatus for generating a high-resolution output video from a low-resolution transmission video composed of some of the frames of a high-resolution original video using motion information of a control device for controlling an object, and objects or location information of the control device, and method can be provided.

또한, 본 발명은 이미지에 비하여 데이터가 작은 동작 정보 및 위치 정보를 이용하여 저해상도의 전송 동영상에 따른 예측 이미지의 정확도를 증가시킴으로써, 고해상도의 출력 동영상과 카메라에서 촬영된 고해상도의 원본 동영상 간의 유사도를 높이면서 동영상 전송에 소요되는 데이터의 증가를 최소화하는 장치 및 방법을 제공할 수 있다.In addition, the present invention increases the similarity between the high-resolution output video and the high-resolution original video captured by the camera by increasing the accuracy of the predicted image according to the low-resolution transmission video using motion information and location information, which have smaller data than the image. It is possible to provide an apparatus and method for minimizing the increase in data required for video transmission.

그리고, 본 발명은 전송 과정에 따라 지연되어 수신한 이전 시간의 동영상과 실시간으로 수신하는 동작 정보 및 위치 정보를 이용하여, 카메라에서 촬영되었으나 아직 수신하지 못한 현재 시간의 동영상을 생성함으로써, 오브젝트와 관련된 동영상을 실시간으로 출력하는 장치 및 방법을 제공할 수 있다.In addition, the present invention generates a video of the current time that has been captured by the camera but has not yet been received by using the video of the previous time received delayed according to the transmission process and the motion information and location information received in real time, thereby generating a video related to an object. An apparatus and method for outputting a video in real time may be provided.

본 발명의 일실시예에 따른 동영상 생성 방법은 고해상도의 원본 동영상에 기초한 저해상도의 전송 동영상을 수신하는 단계; 상기 전송 동영상에 포함된 오브젝트들 간의 상호 작용과 관련된 제어 장치의 동작 정보, 및 상기 오브젝트들, 또는 제어 장치의 위치 정보를 수신하는 단계; 상기 전송 동영상과 상기 동작 정보 및 상기 위치 정보를 이용하여 상기 원본 동영상에 포함되어 있으나 상기 전송 동영상에 누락된 프레임에서의 상기 오브젝트들 간의 상호 작용 힘(Interaction force)의 변화에 대한 예측 이미지들을 생성하는 단계; 및 상기 예측 이미지들 중에 진짜 이미지(real image)를 판별하여 상기 원본 동영상에 대응되는 고해상도의 출력 동영상을 생성하는 단계를 포함할 수 있다.A video generation method according to an embodiment of the present invention comprises the steps of: receiving a low-resolution transmission video based on a high-resolution original video; receiving motion information of a control device related to an interaction between objects included in the transmission video and location information of the objects or the control device; Generating prediction images for a change in interaction force between the objects in the frame included in the original video but missing in the transmitted video using the transmitted video, the motion information, and the location information step; and generating a high-resolution output video corresponding to the original video by determining a real image from among the predicted images.

본 발명의 일실시예에 따른 동영상 생성 방법의 전송 동영상은, 오브젝트들 간의 상호 작용을 촬영하여 생성한 고해상도의 원본 동영상에서 추출한 일부의 프레임들로 구성된 동영상일 수 있다.The transmission video of the video generating method according to an embodiment of the present invention may be a video composed of some frames extracted from a high-resolution original video generated by photographing interactions between objects.

본 발명의 일실시예에 따른 동영상 생성 방법의 동작 정보는, 상기 제어 장치가 상기 오브젝트에 가하는 힘, 상기 오브젝트에 대한 상기 제어 장치의 그립 각도 중 적어도 하나를 포함할 수 있다.The motion information of the video generating method according to an embodiment of the present invention may include at least one of a force applied by the control device to the object and a grip angle of the control device with respect to the object.

본 발명의 일실시예에 따른 동영상 생성 방법은 상기 원본 동영상의 프레임들 중에서 상기 전송 동영상에 포함되지 않은 프레임을 상기 진짜 이미지로 판별된 예측 이미지들과 비교하는 단계; 및 비교 결과에 따라 판별 모델을 학습하는 단계를 더 포함하는 고해상도 동영상 생성 방법.A video generating method according to an embodiment of the present invention comprises: comparing frames not included in the transmitted video among the frames of the original video with predicted images determined as the real image; and learning a discriminant model according to the comparison result.

본 발명의 일실시예에 따른 동영상 생성 방법의 고해상도 동영상을 생성하는 단계는, 판별 모델을 이용하여 상기 예측 이미지들을 진짜 이미지 또는 가짜 이미지(fake image)로 판별하고, 진짜 이미지로 판별된 예측 이미지들을 결합하여 상기 출력 동영상을 생성할 수 있다.The step of generating a high-resolution video of the video generating method according to an embodiment of the present invention includes determining the predicted images as real images or fake images using a discrimination model, and predicting images determined as real images. Combined to generate the output video.

본 발명의 일실시예에 따른 동영상 생성 방법의 진짜 이미지는, 상기 원본 동영상의 프레임들 중에서 전송 동영상으로 추출되지 않은 프레임과의 유사도가 임계값 이상인 예측 이미지이고, 상기 가짜 이미지는, 상기 원본 동영상의 프레임들 중에서 전송 동영상으로 추출되지 않은 프레임과의 유사도가 임계값 미만인 예측 이미지일 수 있다.The real image of the video generating method according to an embodiment of the present invention is a predicted image having a similarity with a frame that is not extracted as a transmission video among the frames of the original video equal to or greater than a threshold value, and the fake image is an image of the original video. It may be a predicted image whose similarity with a frame that is not extracted as a transmission video among the frames is less than a threshold value.

본 발명의 일실시예에 따른 동영상 생성 방법의 판별 모델은, 테스트용 전송 동영상들, 상기 테스트용 전송 동영상들에 포함된 오브젝트들 간의 상호 작용과 관련된 제어 장치의 동작 정보, 및 상기 오브젝트들, 또는 제어 장치의 위치 정보를 이용하여 상기 오브젝트들 간의 상호 작용 힘의 변화에 대한 테스트 예측 이미지들을 생성하고, 테스트용 전송 동영상과 검증 정보를 이용하여 상기 테스트 예측 이미지들 중에서 진짜 이미지를 판별하기 위한 진짜 특징 정보를 학습하며, 상기 테스트용 전송 동영상과 상기 예측 이미지를 이용하여 상기 테스트 예측 이미지들 중에서 가짜 이미지를 판별하기 위한 가짜 특징 정보를 학습할 수 있다.The discrimination model of the video generation method according to an embodiment of the present invention includes test transmission videos, operation information of a control device related to an interaction between objects included in the test transmission videos, and the objects, or A real feature for generating test prediction images for the change in the interaction force between the objects using the location information of the control device, and discriminating a real image from among the test prediction images using the test transmission video and verification information Information is learned, and fake feature information for discriminating a fake image from among the test prediction images may be learned using the test transmission video and the predicted image.

본 발명의 일실시예에 따른 동영상 생성 방법의 판별 모델은, 상기 동작 정보, 및 상기 위치 정보를 LSTM-FC(Long Short Term Memory - Fully Connected) 네트워크에 입력하여 제1 특징 맵을 생성하고, 상기 테스트용 전송 동영상들이 촬영된 시간과 상기 테스트용 전송 동영상들을 획득한 시간 간의 차이를 나타내는 시간 정보를 상기 제1 특징 맵과 병합하여 제2 특징 맵을 생성하며, 상기 테스트용 전송 동영상들을 부호화하여 제3 특징 맵을 생성하고, 상기 제3 특징 맵과 상기 제2 특징 맵을 병합하여 제4 특징 맵을 생성하며, 상기 제4 특징 맵을 복호화하여 상기 예측 이미지들을 생성하는 고해상도 동영상 생성 방법.The discrimination model of the video generation method according to an embodiment of the present invention generates a first feature map by inputting the motion information and the location information to an LSTM-FC (Long Short Term Memory - Fully Connected) network, A second feature map is generated by merging time information indicating a difference between the time at which the test transmission videos are captured and the time at which the test transmission videos are acquired, with the first feature map, and the test transmission videos are encoded A method of generating a high-resolution video for generating a third feature map, generating a fourth feature map by merging the third feature map and the second feature map, and generating the predicted images by decoding the fourth feature map.

본 발명의 일실시예에 따른 동영상 생성 방법의 판별 모델은, 상기 테스트 예측 이미지와 검증 정보를 이용하여 MSE(Mean squared error) 손실 함수, GDL(Gradient Difference loss) 손실 함수, 및 적대적(Adversarial) 손실 함수를 결정하고, 상기 MSE 손실 함수, 상기 GDL 손실 함수, 및 상기 적대적 손실 함수를 조합하여 진짜 특징 정보 및 가짜 특징 정보를 학습하는 판별기의 손실 함수를 결정할 수 있다.The discrimination model of the video generation method according to an embodiment of the present invention is a mean squared error (MSE) loss function, a gradient difference loss (GDL) loss function, and an adversarial loss using the test prediction image and verification information. The loss function of the discriminator for learning real feature information and fake feature information may be determined by determining a function and combining the MSE loss function, the GDL loss function, and the adversarial loss function.

본 발명의 일실시예에 따른 동영상 생성 방법의 전송 동영상은, 상기 고해상도의 원본 동영상이 촬영된 시점으로부터 일정 시간 이상 지연된 시간에 수신되는 이전 시간의 동영상이고, 상기 예측 이미지들을 생성하는 단계는, 상기 동작 정보 및 상기 위치 정보를 이용하여 상기 전송 동영상에 누락된 프레임 및 상기 전송 동영상에 포함된 프레임의 현재 시간에서의 상기 오브젝트들 간의 상호 작용 힘의 변화에 대한 예측 이미지들을 생성할 수 있다.The transmission video of the method for generating a video according to an embodiment of the present invention is a video of a previous time received at a time delayed by a predetermined time or more from the time when the high-resolution original video was taken, and generating the predicted images includes: Using the motion information and the location information, it is possible to generate prediction images for a change in interaction force between the objects at a current time of a frame missing in the transmitted video and a frame included in the transmitted video by using the motion information and the location information.

본 발명의 일실시예에 따른 실시간 동영상 생성 방법은 전송 과정에서의 지연에 따라 이전 시간의 동영상을 수신하는 단계; 상기 동영상에 포함된 오브젝트들 간의 상호 작용과 관련된 제어 장치의 현재 동작 정보, 및 상기 오브젝트들, 또는 제어 장치의 현재 위치 정보를 수신하는 단계; 상기 이전 시간의 동영상과 상기 현재 동작 정보 및 상기 현재 위치 정보를 이용하여 현재 시간에서의 상기 오브젝트들 간의 상호 작용 힘(Interaction force)의 변화에 대한 예측 이미지들을 생성하는 단계; 및 상기 예측 이미지들 중에 진짜 이미지(real image)를 판별하여 현재 시간의 동영상을 생성하는 단계를 포함할 수 있다.A method for generating a real-time video according to an embodiment of the present invention includes: receiving a video of a previous time according to a delay in a transmission process; receiving current motion information of a control device related to an interaction between the objects included in the video and current location information of the objects or the control device; generating prediction images for a change in interaction force between the objects at a current time using the moving picture of the previous time, the current motion information, and the current location information; and generating a video of the current time by determining a real image from among the predicted images.

본 발명의 일실시예에 따른 실시간 동영상 생성 방법의 이전 시간의 동영상은, 카메라가 이전 시간에 오브젝트들 간의 상호 작용을 촬영하여 생성한 동영상이 전송 과정에서 지연되어 현재 시간에 수신한 동영상일 수 있다.The moving picture of the previous time of the real-time moving picture generating method according to an embodiment of the present invention may be a moving picture received at the current time because the moving picture generated by the camera capturing the interaction between objects at the previous time is delayed in the transmission process. .

본 발명의 일실시예에 따른 실시간 동영상 생성 방법의 상기 이전 시간의 동영상은, 카메라가 이전 시간에 오브젝트들 간의 상호 작용을 촬영하여 생성한 고해상도의 원본 동영상에서 추출한 일부의 프레임들로 구성된 저해상도의 전송 동영상일 수 있다.The video of the previous time of the real-time video generating method according to an embodiment of the present invention is a low-resolution transmission composed of some frames extracted from a high-resolution original video generated by a camera capturing interactions between objects at a previous time It could be a video.

본 발명의 일실시예에 따른 실시간 동영상 생성 방법의 예측 이미지들을 생성하는 단계는, 상기 원본 동영상에 포함되어 있으나 상기 전송 동영상에 누락된 프레임의 현재 시간에서의 상기 오브젝트들 간의 상호 작용 힘의 변화에 대한 예측 이미지들을 생성할 수 있다.The generating of the predicted images of the real-time video generating method according to an embodiment of the present invention is based on a change in the interaction force between the objects at a current time of a frame included in the original video but missing in the transmitted video. It is possible to generate prediction images for

본 발명의 일실시예에 따른 동영상 생성 장치는 고해상도의 원본 동영상에 기초한 저해상도의 전송 동영상을 수신하는 이미지 수신부; 상기 전송 동영상에 포함된 오브젝트들 간의 상호 작용과 관련된 제어 장치의 동작 정보, 및 상기 오브젝트들, 또는 제어 장치의 위치 정보를 수신하는 동작 정보 수신부; 상기 전송 동영상과 상기 동작 정보 및 상기 위치 정보를 이용하여 상기 원본 동영상에 포함되어 있으나 상기 전송 동영상에 누락된 프레임에서의 상기 오브젝트들 간의 상호 작용 힘(Interaction force)의 변화에 대한 예측 이미지들을 생성하는 예측 이미지 생성부; 및 상기 예측 이미지들 중에 진짜 이미지(real image)를 판별하여 상기 원본 동영상에 대응되는 고해상도의 출력 동영상을 생성하는 동영상 생성부를 포함할 수 있다.A video generating apparatus according to an embodiment of the present invention includes: an image receiving unit for receiving a low-resolution transmission video based on a high-resolution original video; a motion information receiver configured to receive motion information of a control device related to an interaction between objects included in the transmitted video, and location information of the objects or the control device; Generating prediction images for a change in interaction force between the objects in the frame included in the original video but missing in the transmitted video using the transmitted video, the motion information, and the location information predictive image generator; and a video generation unit that determines a real image from among the predicted images and generates a high-resolution output video corresponding to the original video.

본 발명의 일실시예에 의하면, 오브젝트를 제어하는 제어 장치의 동작 정보, 및 오브젝트들, 또는 제어 장치의 위치 정보를 이용하여 고해상도의 원본 동영상의 프레임들 중 일부로 구성된 저해상도의 전송 동영상으로부터 고해상도의 출력 동영상을 생성할 수 있다.According to an embodiment of the present invention, high-resolution output from a low-resolution transmission video composed of some of the frames of a high-resolution original video using motion information of a control device for controlling an object and location information of the objects or the control device You can create a video.

또한, 본 발명의 일실시예에 의하면, 이미지에 비하여 데이터가 작은 동작 정보 및 위치 정보를 이용하여 저해상도의 전송 동영상에 따른 예측 이미지의 정확도를 증가시킴으로써, 고해상도의 출력 동영상과 카메라에서 촬영된 고해상도의 원본 동영상 간의 유사도를 높이면서 동영상 전송에 소요되는 데이터의 증가를 최소화할 수 있다.In addition, according to an embodiment of the present invention, by increasing the accuracy of the predicted image according to the low-resolution transmission video by using motion information and location information, which have smaller data than the image, the high-resolution output video and the high-resolution camera shot While increasing the similarity between original videos, it is possible to minimize the increase in data required for video transmission.

그리고, 본 발명의 일실시예에 의하면, 전송 과정에 따라 지연되어 수신한 이전 시간의 동영상과 실시간으로 수신하는 동작 정보 및 위치 정보를 이용하여, 카메라에서 촬영되었으나 아직 수신하지 못한 현재 시간의 동영상을 생성함으로써, 오브젝트와 관련된 동영상을 실시간으로 출력할 수 있다.And, according to an embodiment of the present invention, using the motion information and location information received in real time and the motion information and location information of the previous time received delayed according to the transmission process, the video of the current time that has been taken by the camera but has not yet been received By creating it, it is possible to output a video related to the object in real time.

도 1은 본 발명의 일실시예에 따른 동영상 생성 장치를 나타내는 도면이다.
도 2는 본 발명의 일실시예에 따라 저해상도의 전송 동영상에서 고해상도의 출력 동영상을 생성하는 과정을 나타내는 도면이다.
도 3은 본 발명의 일실시예에 따른 동영상 생성 장치의 동작을 나타내는 도면이다.
도 4는 본 발명의 일실시예에 따른 동영상 생성 장치의 일례이다.
도 5는 도 4에 도시된 생성기의 상세 도면이다.
도 6은 도 5에 도시된 U-net의 상세 도면이다.
도 7은 도 4에 도시된 판별기의 상세 도면이다.
도 8은 본 발명의 일실시예에 따른 제어 장치의 일례이다.
도 9는 본 발명의 일실시예에 따른 판별 모델을 학습하기 위하여 사용한 테스트 세트의 일례이다.
도 10은 본 발명의 일실시예에 따른 학습 네트워크에 대한 테스트 세트의 재구성 과정을 나타내는 도면이다.
도 11은 본 발명의 일실시예에 따라 오브젝트 별로 생성한 예측 이미지 및 원본 동영상의 일례이다.
도 12은 본 발명의 일실시예에 따라 생성한 동영상과 기존 방법에 따라 생성한 동영상 및 원본 동영상의 일례이다.
도 13은 본 발명의 일실시예에 따른 RMSE, PSNR 및 SSIM의 일례이다.
도 14는 본 발명의 일실시예에 따른 고해상도 동영상 생성 방법을 도시한 플로우차트이다.
도 15는 본 발명의 일실시예에 따른 실시간 동영상 생성 방법을 도시한 플로우차트이다.1 is a diagram illustrating an apparatus for generating a moving picture according to an embodiment of the present invention.
2 is a diagram illustrating a process of generating a high-resolution output video from a low-resolution transmission video according to an embodiment of the present invention.
3 is a diagram illustrating an operation of a video generating apparatus according to an embodiment of the present invention.
4 is an example of a video generating apparatus according to an embodiment of the present invention.
Fig. 5 is a detailed view of the generator shown in Fig. 4;
6 is a detailed view of the U-net shown in FIG.
FIG. 7 is a detailed view of the discriminator shown in FIG. 4 .
8 is an example of a control device according to an embodiment of the present invention.
9 is an example of a test set used to learn a discriminant model according to an embodiment of the present invention.
10 is a diagram illustrating a reconstruction process of a test set for a learning network according to an embodiment of the present invention.
11 is an example of a predicted image and an original video generated for each object according to an embodiment of the present invention.
12 is an example of a moving picture generated according to an embodiment of the present invention, a moving picture generated according to the existing method, and an original moving picture.
13 is an example of RMSE, PSNR and SSIM according to an embodiment of the present invention.
14 is a flowchart illustrating a method of generating a high-resolution video according to an embodiment of the present invention.
15 is a flowchart illustrating a method for generating a real-time video according to an embodiment of the present invention.

이하에서, 첨부된 도면을 참조하여 실시예들을 상세하게 설명한다. 그러나, 실시예들에는 다양한 변경이 가해질 수 있어서 특허출원의 권리 범위가 이러한 실시예들에 의해 제한되거나 한정되는 것은 아니다. 실시예들에 대한 모든 변경, 균등물 내지 대체물이 권리 범위에 포함되는 것으로 이해되어야 한다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. However, since various changes may be made to the embodiments, the scope of the patent application is not limited or limited by these embodiments. It should be understood that all modifications, equivalents and substitutes for the embodiments are included in the scope of the rights.

실시예에서 사용한 용어는 단지 설명을 목적으로 사용된 것으로, 한정하려는 의도로 해석되어서는 안된다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the examples are used for description purposes only, and should not be construed as limiting. The singular expression includes the plural expression unless the context clearly dictates otherwise. In the present specification, terms such as “comprise” or “have” are intended to designate that a feature, number, step, operation, component, part, or combination thereof described in the specification exists, but one or more other features It should be understood that this does not preclude the existence or addition of numbers, steps, operations, components, parts, or combinations thereof.

또한, 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 실시예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 실시예의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.In addition, in the description with reference to the accompanying drawings, the same components are given the same reference numerals regardless of the reference numerals, and the overlapping description thereof will be omitted. In the description of the embodiment, if it is determined that a detailed description of a related known technology may unnecessarily obscure the gist of the embodiment, the detailed description thereof will be omitted.

이하, 본 발명의 실시예를 첨부된 도면을 참조하여 상세하게 설명한다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일실시예에 따른 동영상 생성 장치를 나타내는 도면이다.1 is a diagram illustrating an apparatus for generating a moving picture according to an embodiment of the present invention.

동영상 생성 장치(100)는 도 1에 도시된 바와 같이 이미지 수신부(110), 동작 정보 수신부(120), 예측 이미지 생성부(130), 판별부(140), 및 동영상 생성부(150)를 포함할 수 있다. 이때, 예측 이미지 생성부(130), 판별부(140), 및 동영상 생성부(150)는 서로 다른 프로세서이거나, 하나의 프로세서에서 수행되는 프로그램에 포함된 각각의 모듈일 수 있다.The video generating apparatus 100 includes an image receiving unit 110 , a motion information receiving unit 120 , a predicted image generating unit 130 , a determining unit 140 , and a moving picture generating unit 150 as shown in FIG. 1 . can do. In this case, the prediction image generator 130 , the determiner 140 , and the video generator 150 may be different processors or respective modules included in a program executed by one processor.

이미지 수신부(110)는 카메라(101)와 유무선으로 연결된 통신 인터페이스를 포함할 수 있다. 이때, 카메라(101)는 오브젝트들 간의 상호 작용을 촬영하여 고해상도의 원본 동영상을 생성할 수 있다. 그리고, 카메라(101)는 고해상도의 원본 동영상에 기초한 저해상도의 전송 동영상을 생성할 수 있다. 예를 들어, 전송 동영상은 고해상도의 원본 동영상의 프레임들 중 기 설정된 시간 간격으로 추출된 일부의 프레임들로 구성될 수 있다. 다음으로, 카메라(101)는 저해상도의 전송 동영상을 이미지 수신부(110)로 전송할 수 있다.The image receiver 110 may include a communication interface connected to the camera 101 by wire or wireless. In this case, the camera 101 may generate a high-resolution original video by photographing the interaction between the objects. In addition, the camera 101 may generate a low-resolution transmission video based on the high-resolution original video. For example, the transmission video may be composed of some frames extracted at a preset time interval among frames of the original high-resolution video. Next, the camera 101 may transmit a low-resolution transmission video to the image receiver 110 .

오브젝트들 간의 상호 작용은 타겟이 되는 오브젝트와 제어 장치(102)를 포함한 다른 오브젝트가 접촉함으로써, 오브젝트들 각각의 위치가 이동하거나, 오브젝트들 중 적어도 하나의 형상이 변형되는 것을 의미할 수 있다.The interaction between the objects may mean that a position of each of the objects is moved or a shape of at least one of the objects is deformed when a target object and another object including the control device 102 come into contact with each other.

이때, 이미지 수신부(110)는 통신 인터페이스를 통하여 카메라(101)로부터 수신한 저해상도의 전송 동영상을 예측 이미지 생성부(130)로 전달할 수 있다. 또한, 이미지 수신부(110)와 카메라(101) 간의 유무선 상태에 따라 카메라(101)가 전송한 저해상도의 전송 동영상이 이미지 수신부(110)에 수신될 때까지 일정 시간이 경과될 수 있다. 따라서, 이미지 수신부(110)가 수신한 저해상도의 전송 동영상은 카메라(101)가 고해상도 동영상을 촬영된 시각보다 일정 시간이 지연된 시각에 수신될 수 있다. 그러므로, 이미지 수신부(110)가 수신한 전송 동영상은 현재 시간의 동영상이 아니라 현지 시간을 기준으로 일정 시간 이전에 카메라(101)에서 촬영된 이전 시간의 동영상일 수 있다.In this case, the image receiving unit 110 may transmit a low-resolution transmission video received from the camera 101 through the communication interface to the predicted image generating unit 130 . Also, depending on the wired/wireless state between the image receiving unit 110 and the camera 101 , a predetermined time may elapse until the low-resolution transmission video transmitted by the camera 101 is received by the image receiving unit 110 . Accordingly, the low-resolution transmission video received by the image receiver 110 may be received at a time delayed by a predetermined time from the time at which the camera 101 captures the high-resolution video. Therefore, the transmission video received by the image receiver 110 may not be a video of the current time, but may be a video of a previous time captured by the camera 101 before a predetermined time based on the local time.

동작 정보 수신부(120)는 오브젝트들 간의 상호 작용과 관련된 제어 장치(102)과 유무선으로 연결된 통신 인터페이스를 포함할 수 있다. 이때, 제어 장치(102)는 오브젝트들의 위치, 또는 상태를 제어할 수 있는 장치일 수 있다. 예를 들어, 제어 장치(102)는 오브젝트를 잡아서 이동 시키거나, 오브젝트에 힘을 가할 수 있는 로봇 팔일 수 있다. 이때, 제어 장치(102)는 로봇 팔이 오브젝트들 간의 상호 작용을 발생시키기 위하여 오브젝트들에 가한 힘, 또는 로봇 팔이 오브젝트들 간의 상호 작용을 발생시키기 위하여 가동한 관절들의 이동 범위 및 관절들에 가한 힘 중 적어도 하나가 포함된 동작 정보를 동작 정보 수신부(120)로 전송할 수 있다. 예를 들어, 동작 정보는 로봇 팔과 같은 제어 장치(102)가 오브젝트에 가하는 힘, 오브젝트에 대한 제어 장치(102)의 그립 각도 중 적어도 하나를 포함할 수 있다.The motion information receiver 120 may include a communication interface connected to the control device 102 related to the interaction between objects via wire or wireless. In this case, the control device 102 may be a device capable of controlling the position or state of the objects. For example, the control device 102 may be a robot arm capable of grabbing and moving an object or applying a force to the object. At this time, the control device 102 determines the force applied to the objects by the robot arm to generate the interaction between the objects, or the range of movement of the joints operated by the robot arm to generate the interaction between the objects and applied to the joints. Motion information including at least one of the forces may be transmitted to the motion information receiver 120 . For example, the motion information may include at least one of a force applied to the object by the control device 102 such as a robot arm and a grip angle of the control device 102 with respect to the object.

또한, 카메라(101)가 차량의 충돌 실험을 촬영하는 경우, 제어 장치(102)는 차량 및 차량과 관련된 센서일 수 있다. 이때, 차량의 엑셀 및 브레이크의 상태를 측정하는 센서는 차량의 엑셀, 또는 브레이크에 가해진 힘이 포함된 동작 정보를 동작 정보 수신부(120)로 전송할 수 있다.Also, when the camera 101 captures a crash test of a vehicle, the control device 102 may be a vehicle and a vehicle-related sensor. In this case, the sensor measuring the state of the vehicle's accelerator and brake may transmit operation information including the force applied to the vehicle's accelerator or brake to the operation information receiver 120 .

그리고, 제어 장치(102), 또는 제어 장치(102)의 외부에 장착된 위치 센서는 오브젝트들 간의 상호 작용에 의하여 변경된 오브젝트들의 위치, 및 제어 장치(102)의 위치 중 적어도 하나가 포함된 위치 정보를 동작 정보 수신부(120)로 전송할 수 있다.In addition, the control device 102 or the position sensor mounted on the outside of the control device 102 includes position information including at least one of the positions of the objects changed by the interaction between the objects, and the positions of the control device 102 . may be transmitted to the operation information receiving unit 120 .

이때, 동작 정보 수신부(120)는 통신 인터페이스를 통하여 제어 장치(102)로부터 수신한 동작 정보 및 위치 정보를 예측 이미지 생성부(130)로 전달할 수 있다.In this case, the motion information receiver 120 may transmit the motion information and location information received from the control device 102 through the communication interface to the prediction image generator 130 .

예측 이미지 생성부(130)는 이미지 수신부(110)로부터 전달받은 저해상도의 전송 동영상과 동작 정보 수신부(120)로부터 수신한 동작 정보 및 위치 정보를 이용하여 오브젝트들 간의 상호 작용 힘(Interaction force)의 변화에 대한 예측 이미지들을 생성할 수 있다. 이때, 예측 이미지 생성부(130)는 원본 동영상에 포함되어 있으나 전송 동영상에 누락된 프레임에서의 오브젝트들 간의 상호 작용 힘의 변화에 대한 예측 이미지들을 생성할 수 있다. The prediction image generating unit 130 uses the low-resolution transmission video received from the image receiving unit 110 and the motion information and location information received from the motion information receiving unit 120 to change the interaction force between objects. It is possible to generate prediction images for . In this case, the prediction image generator 130 may generate prediction images for a change in the interaction force between objects in a frame included in the original video but missing in the transmission video.

구체적으로, 예측 이미지 생성부(130)는 누락된 프레임에 대응하는 시간의 동작 정보 및 위치 정보를 사용하여 제어 장치(102)의 동작과 오브젝트의 위치를 실시간으로 확인할 수 있다. 그리고, 예측 이미지 생성부(130)는 확인한 제어 장치(102)의 동작과 오브젝트의 위치를 누락된 프레임에 대한 예측 이미지들을 생성하는 과정에 적용함으로써, 전송 동영상에 포함된 프레임에서 예측하지 못하는 방향이나 크기로 오브젝트의 위치 변화 또는 오브젝트의 형상 변화가 발생하더라도, 오브젝트의 위치 변화, 또는 오브젝트의 형상 변화에 대응하는 예측 이미지를 생성할 수 있다.Specifically, the prediction image generator 130 may check the motion of the control device 102 and the position of the object in real time using the motion information and location information of the time corresponding to the missing frame. And, the prediction image generating unit 130 applies the checked operation of the control device 102 and the position of the object to the process of generating prediction images for the missing frame, so that the direction or Even if a change in the position of an object or a change in the shape of the object occurs due to the size, a prediction image corresponding to the change in the position of the object or the change in the shape of the object may be generated.

또한, 예측 이미지 생성부(130)는 이전 시간의 동영상과 현재 시간의 동작 정보 및 위치 정보를 이용하여 현재 시간에서의 오브젝트들 간의 상호 작용 힘의 변화에 대한 예측 이미지들을 생성할 수도 있다. 이때, 이전 시간의 동영상은 카메라(101)가 이전 시간에 촬영된 원본 동영상, 또는 카메라(101)가 이전 시간에 생성한 전송 동영상일 수 있다. 그리고, 이전 시간의 동영상은 전송 과정의 지연에 의하여 현재 시간에 수신된 상태일 수 있다.Also, the prediction image generator 130 may generate prediction images for a change in the interaction force between objects at the current time by using the moving picture of the previous time and the motion information and location information of the current time. In this case, the moving picture of the previous time may be an original moving picture captured by the camera 101 at a previous time, or a transmission moving picture generated by the camera 101 at a previous time. In addition, the video of the previous time may be received at the current time due to a delay in the transmission process.

동영상은 동작 정보 및 위치 정보에 비하여 용량이 크므로, 동영상의 용량이나, 동영상을 전송하는 네트워크의 성능에 따라 카메라(101)에서 동영상을 전송하는 시간과 동영상 생성 장치(100)가 동영상을 수신하는 시간 사이에 지연이 발생할 수 있다. 이때, 동영상의 용량이 클수록 동영상의 전송에 필요한 시간이 증가하므로, 지연 시간도 증가할 수 있다. 따라서, 카메라(101)는 고해상도의 원본 동영상에서 일부의 프레임들을 추출하여 저해상도이며 원본 동영상보다 용량이 감소된 전송 동영상을 생성하여 동영상 생성 장치(100)로 전송함으로써, 카메라(101)에서 동영상을 촬영한 시간과 동영상 생성 장치(100)가 동영상을 수신하는 시간 사이의 지연을 최소화할 수 있다.Since the video has a larger capacity than the motion information and location information, the amount of time the camera 101 transmits the video and the amount of time the video generating device 100 receives the video depends on the capacity of the video or the performance of the network for transmitting the video. There may be delays in time. In this case, since the time required to transmit the video increases as the capacity of the video increases, the delay time may also increase. Therefore, the camera 101 extracts some frames from the high-resolution original video, generates a low-resolution, reduced-capacity transmission video compared to the original video, and transmits it to the video generating device 100, thereby capturing a video in the camera 101 The delay between one time and the time the video generating apparatus 100 receives the video may be minimized.

다만, 전송 동영상은 원본 동영상보다는 용량이 감소되었으나, 복수의 이미지들로 구성된 정보이므로, 오브젝트에 가한 포스, 오브젝트의 각도와 같은 값(value)들로 구성된 동작 정보 및 오브젝트의 위치를 나타내는 위치 정보보다 용량이 클 수 있다. 따라서, 전송 동영상이 원본 동영상보다 지연 시간은 감소하였으나, 실시간으로 수신 가능한 동작 정보 및 위치 정보와 달리 일정 시간의 지연은 발생할 수 있다. 그러므로, 동영상 생성 장치(100)는 카메라(101)가 촬영한 동영상을 실시간으로 수신하지 못할 수 있다. 즉, 이미지 수신부(110)가 수신한 전송 동영상은 카메라(101)에서 고해상도의 원본 동영상이 촬영된 시점으로부터 일정 시간 이상 지연된 시간에 수신되는 이전 시간의 전송 동영상일 수 있다.However, the transmission video has a reduced capacity compared to the original video, but since it is information composed of a plurality of images, it is better than motion information composed of values such as a force applied to an object and an angle of an object and position information indicating the position of the object. capacity can be large. Therefore, although the delay time of the transmitted video is reduced compared to the original video, a delay of a certain time may occur unlike the motion information and location information that can be received in real time. Therefore, the video generating apparatus 100 may not receive the video captured by the camera 101 in real time. That is, the transmission video received by the image receiver 110 may be a transmission video of a previous time received at a time delayed by a predetermined time or more from the point in time when the high-resolution original video was captured by the camera 101 .

이때, 예측 이미지 생성부(130)는 현재 시간에 수신한 이전 시간의 동영상과 현재 동작 정보 및 위치 정보를 이용하여 현재 시간에서의 오브젝트들 간의 상호 작용 힘의 변화에 대한 예측 이미지들을 생성할 수 있다. 그리고, 동영상 생성부(150)는 예측 이미지들을 이용하여 현재 시간의 동영상을 생성함으로써, 카메라(101)가 촬영한 원본 동영상과 동일한 해상도를 가지며 최대한 높은 유사도를 가지는 출력 동영상을 실시간으로 출력할 수 있다.In this case, the prediction image generator 130 may generate prediction images for a change in the interaction force between objects at the current time by using the moving picture of the previous time received at the current time, the current motion information, and the location information. . In addition, the video generating unit 150 generates a video of the current time by using the predicted images, thereby outputting an output video having the same resolution as the original video captured by the camera 101 and having the highest similarity in real time. .

판별부(140)는 판별 모델을 이용하여 예측 이미지 생성부(130)가 생성한 예측 이미지들 중에 진짜 이미지(real image)를 판별할 수 있다. 이때, 판별기(140)는 판별 모델을 이용하여 예측 이미지들을 진짜 이미지 또는 가짜 이미지(fake image)로 판별할 수 있다. 또한, 판별 모델은 테스트용 전송 동영상과 검증 정보를 이용하여 학습될 수 있다. 이때, 진짜 이미지는 원본 동영상의 프레임들 중에서 전송 동영상으로 추출되지 않은 프레임과의 유사도가 임계값 이상인 예측 이미지이고, 가짜 이미지는, 원본 동영상의 프레임들 중에서 전송 동영상으로 추출되지 않은 프레임과의 유사도가 임계값 미만인 예측 이미지일 수 있다.The determiner 140 may determine a real image from among the predicted images generated by the predictive image generator 130 by using the discrimination model. In this case, the discriminator 140 may discriminate the predicted images as real images or fake images using the discriminant model. In addition, the discriminant model may be trained using a test transmission video and verification information. At this time, the real image is a predicted image whose similarity with frames not extracted as a transmission video among frames of the original video is equal to or greater than a threshold value, and a fake image has a similarity with a frame that is not extracted as a transmission video among frames of the original video. It may be a prediction image that is less than a threshold.

또한, 예측 이미지 생성부(130)가 시간(t-1)의 동영상과 시간(t)의 동작 정보 및 시간(t)의 위치 정보를 이용하여 시간(t)에서의 오브젝트들 간의 상호 작용 힘의 변화에 대한 예측 이미지들을 생성하고, 판별부(140)가 해당 예측 이미지들 중에 진짜 이미지를 판별할 수 있다. 이때, 이미지 수신부(110)는 시간 (t+1)에서 시간(t)의 동영상을 수신할 수 있다.In addition, the prediction image generating unit 130 uses the motion information of the time t, the motion information of the time t, and the position information of the time t-1 of the interaction force between the objects at the time t. Predicted images for the change are generated, and the determining unit 140 may determine a real image from among the corresponding predicted images. In this case, the image receiving unit 110 may receive a video of time (t) at time (t+1).

그리고, 판별부(140)는 시간(t)에서의 오브젝트들 간의 상호 작용 힘의 변화에 대한 예측 이미지들 중에서 진짜 이미지로 판별된 예측 이미지를 시간(t)의 동영상과 비교하여 판별 모델을 학습할 수 있다. 예를 들어, 시간(t)의 동영상과 진짜 이미지로 판별된 예측 이미지가 상이한 경우, 판별부(140)는 진짜 이미지로 판별되었던 예측 이미지가 가짜 이미지로 판별하도록 판별 모델을 학습할 수 있다.Then, the determination unit 140 compares the prediction image determined as a real image among the prediction images for the change of the interaction force between the objects at time t with the video at time t to learn the discrimination model. can For example, when the video at time t is different from the predicted image determined as the real image, the determining unit 140 may learn the discrimination model to determine the predicted image determined as the real image as the fake image.

모델 학습 장치(103)는 예측 모델을 학습하여 예측 이미지 생성부(130)에 전달하고, 판별 모델을 학습하여 판별부(140)에 전달할 수 있다. 이때, 모델 학습 장치(103)는 이미지 수신부; 동작 정보 수신부; 예측 이미지 생성부; 및 판별부를 포함할 수 있다.The model learning apparatus 103 may learn the predictive model and transmit it to the predictive image generating unit 130 , and may learn the discrimination model and transmit it to the determiner 140 . At this time, the model learning apparatus 103 includes an image receiving unit; operation information receiving unit; predictive image generator; and a determining unit.

모델 학습 장치(103)의 이미지 수신부는 테스트용 전송 동영상들을 수신하거나 입력받을 수 있다.The image receiving unit of the model learning apparatus 103 may receive or receive test transmission videos.

모델 학습 장치(103)의 동작 정보 수신부는 테스트용 전송 동영상들에 포함된 오브젝트들 간의 상호 작용과 관련된 오브젝트 제어 장치의 동작 정보, 및 오브젝트들, 또는 오브젝트 제어 장치의 위치 정보를 수신하거나, 입력받을 수 있다.The motion information receiving unit of the model learning apparatus 103 receives or receives motion information of the object control device related to the interaction between objects included in the test transmission videos, and location information of the objects or the object control device. can

모델 학습 장치(103)의 예측 이미지 생성부는 테스트용 전송 동영상과 동작 정보 및 위치 정보를 이용하여 오브젝트들 간의 상호 작용 힘의 변화에 대한 예측 이미지들을 생성할 수 있다.The prediction image generator of the model learning apparatus 103 may generate prediction images for a change in interaction force between objects by using a test transmission video, motion information, and location information.

모델 학습 장치(103)의 판별부는 테스트용 전송 동영상과 검증 정보를 이용하여 예측 이미지들 중에서 진짜 이미지를 판별하기 위한 진짜 특징 정보를 학습하고, 테스트용 전송 동영상과 예측 이미지를 이용하여 예측 이미지들 중에서 가짜 이미지를 판별하기 위한 가짜 특징 정보를 학습할 수 있다.The determining unit of the model learning device 103 learns real feature information for discriminating a real image from among the predicted images by using the test transmission video and the verification information, and uses the test transmission video and the prediction image from among the predicted images. Fake feature information for discriminating a fake image may be learned.

또한, 동영상 생성 장치(100)의 이미지 수신부(110), 동작 정보 수신부(120), 예측 이미지 생성부(130), 및 판별부(140)가 모델 학습 장치(103)의 이미지 수신부, 동작 정보 수신부, 예측 이미지 생성부, 및 판별부의 동작을 수행하여 판별 모델을 학습시킬 수도 있다.In addition, the image receiving unit 110 , the motion information receiving unit 120 , the predicted image generating unit 130 , and the determining unit 140 of the video generating apparatus 100 are the image receiving unit and the motion information receiving unit of the model learning apparatus 103 . , the predictive image generator, and the discriminator may be performed to train the discriminant model.

이때, 예측 이미지 생성부(130)는 진짜 이미지로 판별된 예측 이미지에 따라 예측 모델을 학습할 수 있다. 예를 들어, 예측 이미지 생성부(130)는 진짜 이미지로 판별된 예측 이미지를 생성하기 위하여 사용한 특징에 가중치를 부여함으로써, 생성한 예측 이미지들 중에 진짜 이미지로 판별된 예측 이미지가 증가하도록 할 수 있다.In this case, the predictive image generator 130 may learn the predictive model according to the predictive image determined as the real image. For example, the prediction image generator 130 may increase the number of predicted images determined as real images among the generated prediction images by giving weights to features used to generate the predicted images determined as real images. .

동영상 생성부(150)는 진짜 이미지로 판별된 예측 이미지들을 이용하여 카메라(101)에서 촬영된 원본 동영상과 해상도가 동일한 고해상도의 출력 동영상을 생성할 수 있다.The video generating unit 150 may generate a high-resolution output video having the same resolution as the original video captured by the camera 101 by using the predicted images determined as real images.

동영상 생성 장치(100)는 오브젝트를 제어하는 제어 장치의 동작 정보, 및 오브젝트들, 또는 제어 장치의 위치 정보를 이용하여 고해상도의 원본 동영상의 프레임들 중 일부로 구성된 저해상도의 전송 동영상으로부터 고해상도의 출력 동영상을 생성할 수 있다.The video generating device 100 generates a high-resolution output video from a low-resolution transmission video composed of some of the frames of the high-resolution original video by using motion information of a control device for controlling the object, and the objects or location information of the control device. can create

이때, 동영상 생성 장치(100)는 이미지에 비하여 데이터가 작은 동작 정보 및 위치 정보를 이용하여 저해상도의 전송 동영상에 따른 예측 이미지의 정확도를 증가시킴으로써, 고해상도의 출력 동영상과 카메라에서 촬영된 고해상도의 원본 동영상 간의 유사도를 높이면서 동영상 전송에 소요되는 데이터의 증가를 최소화할 수 있다.At this time, the video generating apparatus 100 increases the accuracy of the predicted image according to the low-resolution transmission video by using motion information and location information that have smaller data than the image, thereby increasing the high-resolution output video and the high-resolution original video captured by the camera. It is possible to minimize the increase in data required for video transmission while increasing the similarity between them.

또한, 동영상 생성 장치(100)는 전송 과정에 따라 지연되어 수신한 이전 시간의 동영상과 실시간으로 수신하는 동작 정보 및 위치 정보를 이용하여, 카메라에서 촬영되었으나 아직 수신하지 못한 현재 시간의 동영상을 생성함으로써, 오브젝트와 관련된 동영상을 실시간으로 출력할 수 있다.In addition, the video generating apparatus 100 generates a video of the current time that is captured by the camera but has not yet been received by using the video of the previous time received delayed according to the transmission process and the motion information and location information received in real time. , a video related to the object can be output in real time.

도 2는 본 발명의 일실시예에 따라 저해상도의 전송 동영상에서 고해상도의 출력 동영상을 생성하는 과정을 나타내는 도면이다. 2 is a diagram illustrating a process of generating a high-resolution output video from a low-resolution transmission video according to an embodiment of the present invention.

카메라 및 제어 장치(로봇 팔)를 포함하는 슬레이브(210)는 도 2에 도시된 바와 같이 고해상도의 원본 동영상(211), 로봇 팔의 동작 정보(Force) 및 로봇 팔의 위치 정보(Pos: position)(212)를 생성할 수 있다. 예를 들어, 원본 동영상은 I^GT(ground truth image)로 정의될 수 있다. 또한, 동작 정보는 로봇 팔과 같은 제어 장치가 오브젝트에 가하는 힘(force)(S^HR_f: robot grip force), 및 로봇 팔이 오브젝트를 잡는 각도(S^HR_g: robot grip angle)를 포함할 수 있다.The slave 210 including a camera and a control device (robot arm) is a high-resolution original video 211, as shown in FIG. 2, motion information (Force) of the robot arm, and position information of the robot arm (Pos: position) (212) can be created. For example, the original video ^{can be defined as I GT} (ground truth image). Also, the motion information may include a force applied to the object by a control device such as a robot arm (S ^HR_f : robot grip force), and an angle at which the robot arm grips the object (S ^HR_g : robot grip angle).

동영상의 용량이 증가할수록 동영상의 전송에 따른 지연 시간이 증가하므로, 슬레이브(210)는 전송하는 동영상의 용량을 감소시키기 위하여 I^GT로부터 일부의 프레임을 추출하여 저해상도의 전송 동영상(221)(I^LR: low-update-rate image)을 생성할 수 있다. 이때, 전송 동영상(221)은 I^LR(low-update-rate image)로 정의될 수 있다. 예를 들어, 슬레이브(210)는 원본 동영상(211)(I^GT)의 6 프레임 중에서 하나의 프레임을 추출하여 전송 동영상(221)(I^LR)으로 생성할 수 있다.As the capacity of the video increases, the delay time due to the transmission of the video increases. Therefore, the slave 210 ^{extracts some frames from the I GT in} order to reduce the capacity of the transmitted video to extract the low-resolution transmission video 221 (I ^LR). : low-update-rate image) can be created. In this case, the transmission video 221 may be defined as a low-update-rate image ( ^{I LR).} For example, the slave 210 may extract one frame from six frames of the ^{original video 211 (I GT} ) and generate ^{the transmission video 221 (I LR ).}

그리고, 슬레이브(210)는 인터넷과 같은 네트워크를 통하여 마스터(230)에게 전송 동영상(221)(I^LR)와 로봇 팔의 동작 정보(Force) 및 로봇 팔의 위치 정보(Pos: position)(212)를 전송할 수 있다.In addition, the slave 210 transmits the moving image 221 (I ^LR ) to the master 230 through a network such as the Internet, motion information (Force) of the robot arm, and position information (Pos: position) of the robot arm (212) can be transmitted.

동영상 생성 장치(100)를 포함하는 마스터(230)는 도 2에 도시된 바와 같이 수신한 전송 동영상(221)(I^LR)와 로봇 팔의 동작 정보(Force) 및 로봇 팔의 위치 정보(Pos: position)(212)를 이용하여 수신하지 못한 원본 동영상(211)(I^GT)의 5 프레임에 해당하는 예측 이미지를 생성함으로써, 원본 동영상(211)(I^GT)과 해상도가 동일한 고해상도 동영상인 출력 동영상(231)를 생성할 수 있다. The master 230 including the video generating device 100 is transmitted video 221 (I ^LR ), motion information (Force) of the robot arm, and position information (Pos) of the robot arm as shown in FIG. position) 212 to generate a predicted image corresponding to 5 frames of the original video 211 (I ^GT ) that has not been received, thereby generating an output video that is a high-resolution video having the same resolution as the ^{original video 211 (I GT )} (231) can be created.

도 3은 본 발명의 일실시예에 따른 동영상 생성 장치의 동작을 나타내는 도면이다. 3 is a diagram illustrating an operation of a video generating apparatus according to an embodiment of the present invention.

동영상 생성 장치(100)는 도 3에 도시된 바와 같이 카메라(101)로부터 전송 동영상(320)을 수신할 수 있다. 이때, 카메라(101)가 전송한 전송 동영상(320)이 동영상 생성 장치(100)로 전송하는 과정에 시간이 소요되므로, 동영상 생성 장치(100)가 수신한 전송 동영상(320)은 이전 시간에 카메라(101)에서 촬영된 이미지일 수 있다.The video generating apparatus 100 may receive the transmission video 320 from the camera 101 as shown in FIG. 3 . At this time, since it takes time for the transmission video 320 transmitted by the camera 101 to be transmitted to the video generating device 100, the transmission video 320 received by the video generating device 100 is transferred to the camera at the previous time. It may be an image taken at 101 .

예를 들어, 시간 0s에서 카메라(101)가 로봇 팔과 오브젝트를 촬영하여 원본 동영상(310)을 생성하고, 원본 동영상(310)에서 일부 프레임을 추출하여 전송 동영상(320)(t=0s)를 생성하며, 생성한 전송 동영상(320)(t=0s)를 동영상 생성 장치(100)로 전송할 수 있다.For example, at time 0s, the camera 101 captures the robot arm and the object to generate the original video 310, and extracts some frames from the original video 310 to transmit the video 320 (t = 0s). The generated transmission video 320 (t=0s) may be transmitted to the video generating apparatus 100 .

시간 1s에서 동영상 생성 장치(100)는 전송 동영상(320) (t=0s)을 수신하였으나, 카메라(101)는 전송 동영상(t=1s)를 생성한 상태일 수 있다. 로봇 팔이 이동 중인 경우, 도 3에 도시된 바와 같이 전송 동영상(t=1s)은 전송 동영상(320) (t=0s)보다 로봇 팔이 오브젝트에서 더 이격된 상태일 수 있다.At time 1s, the video generating apparatus 100 may have received the transmitted video 320 (t=0s), but the camera 101 may have generated the transmitted video (t=1s). When the robot arm is moving, as shown in FIG. 3 , the transmitted video (t=1s) may be a state in which the robot arm is more spaced apart from the object than the transmitted video 320 (t=0s).

즉, 시간 1s에서 동영상 생성 장치(100)가 수신한 전송 동영상(320)(t=0s)과 카메라(101)가 로봇 팔을 촬영하여 생성한 전송 동영상(t=1s)은 상이하므로, 사용자가 전송 동영상(320)(t=0s)을 참조하여 로봇 팔을 제어하면 오작동을 할 가능성이 있다.That is, since the transmission video 320 (t=0s) received by the video generating device 100 at time 1s and the transmission video generated by the camera 101 photographing the robot arm (t=1s) are different, the user If the robot arm is controlled by referring to the transmission video 320 (t=0s), there is a possibility of a malfunction.

따라서, 동영상 생성 장치(100)는 수신한 전송 동영상(320)(t=0s)와 현재 시간의 로봇 팔의 동작 정보(force), 로봇 팔의 위치 정보(position)을 이용하여 전송 동영상(t=1s)와 유사도가 임계값 이상인 예측 이미지(324)를 생성하여 표시할 수 있다. 이때, 동작 정보는 로봇 팔의 관절들 각각이 움직이는 각도(322)(S^HR_g) 및 제어 장치가 오브젝트에 가하는 힘(321)(S^HR_f)을 포함할 수 있다. 또한, 동영상 생성 장치(100)는 전송 동영상이 전송 과정에서 지연된 시간을 나타내는 시간 정보(323)를 수신할 수 있다. 이때, 시간 정보(323)는 현재 시간과 마지막으로 수신한 전송 동영상이 촬영된 이전 시간 간의 간격을 나타내는 지연 시간(T^D: delay time)일 수 있다. 예를 들어, 도 3에서 지연 시간은 1초일 수 있다. Therefore, the video generating apparatus 100 uses the received transmission video 320 (t=0s), the motion information (force) of the robot arm at the current time, and the position information (position) of the robot arm, the transmission video (t= 1s) and a predictive image 324 having a similarity equal to or greater than a threshold value may be generated and displayed. In this case, the motion information may include an angle 322 (S ^HR_g ^{) at which each of the joints of the robot arm moves and a force 321 (S HR_f} ) applied by the control device to the object. Also, the video generating apparatus 100 may receive time information 323 indicating a delay time of the transmitted video in the transmission process. ^{In this case, the time information 323 may be a delay time (T D} : delay time) indicating an interval between the current time and the previous time at which the last received transmission video was captured. For example, the delay time in FIG. 3 may be 1 second.

그리고, 동영상 생성 장치(100)는 예측 이미지(324)들 중에서 진짜 이미지들로 판별된 예측 이미지들과 전송 동영상(320)을 결합하여 원본 동영상(310)에 대응되는 고해상도의 출력 동영상(33)을 생성할 수 있다.Then, the video generating apparatus 100 combines the transmitted video 320 with the predicted images determined as real images among the predicted images 324 to generate a high-resolution output video 33 corresponding to the original video 310 . can create

도 4는 본 발명의 일실시예에 따른 동영상 생성 장치의 일례이다. 4 is an example of a video generating apparatus according to an embodiment of the present invention.

동영상 생성 장치(100)는 생성기(Generator)(410)와 판별기(Discriminator)(420)를 포함할 수 있다. 이때, 생성기(410)는 예측 이미지 생성부(130)의 일례이고, 판별기(420)는 판별부(140)의 일례이다.The video generating apparatus 100 may include a generator 410 and a discriminator 420 . In this case, the generator 410 is an example of the prediction image generator 130 , and the discriminator 420 is an example of the determiner 140 .

생성기(410)는 도 4에 도시된 바와 같이 카메라(101)로부터 수신한 전송 동영상(401)(I^LR), 제어 장치(102)로부터 수신한 동작 정보, 및 시간 정보(404)를 이용하여 예측 이미지(411)(I^HR)를 생성할 수 있다. 이때, 동작 정보는 제어 장치(로봇 팔)이 오브젝트에 가하는 힘(402)(S^HR_f), 및 로봇 팔이 오브젝트를 잡는 각도(403)(S^HR_g)를 포함할 수 있다. 이때, 로봇 팔이 오브젝트에 가하는 힘(402)(S^HR_f)은 로봇 팔과 오브젝트 간에 작용하는 상호 작용 힘을 측정하여 결정된 상호 작용력일 수 있다 또한, 로봇 팔이 오브젝트를 잡는 각도(403)(S^HR_g)는 로봇 팔의 위치를 나타내며, 각도만이 아니라 로봇 팔이 오브젝트를 잡은 위치, 또는 오브젝트와 접촉한 로봇 팔이 측정한 오브젝트의 위치를 포함할 수도 있다. ^{The generator 410 predicts using the transmission video 401 (I LR} ) received from the camera 101, the motion information received from the control device 102, and the time information 404 as shown in FIG. 4 . An image 411 (I ^HR ) may be created. In this case, the motion information may include a force 402 (S HR_f ) applied to the object by the control device (robot arm), and an angle 403 (S ^HR_g ) at which the ^{robot arm grabs the object.} In this case, the force 402 (S ^HR_f ) applied by the robot arm to the object may be an interaction force determined by measuring the interaction force acting between the robot arm and the object. ^HR_g ) indicates the position of the robot arm, and may include not only the angle but also the position of the robot arm holding the object, or the position of the object measured by the robot arm in contact with the object.

구체적으로, 생성기(410)는 이전 시간의 전송 동영상(401)(I^LR)과 현재 시간의 동작 정보 및 이전 시간과 현재 시간 간의 차이를 나타내는 시간 정보(404)를 이용하여 현재 시간의 전송 동영상에 대응되는 예측 이미지(411)(I^HR: High-update-rate image)를 생성할 수 있다. Specifically, the generator 410 is a transmission video ^{of the current time using the transmission video 401 (I LR} ) of the previous time, the operation information of the current time, and the time information 404 indicating the difference between the previous time and the current time. A corresponding prediction image 411 (I ^HR : high-update-rate image) may be generated.

또한, 생성기(410)는 동작 정보와 시간 정보(404)를 처리하기 위한 LSTM-FC(Long Short Term Memory- Fully connected) 네트워크 및 LSTM-FC 네트워크의 출력과 전송 동영상(401)를 이용하여 예측 이미지(411)(I^HR)를 생성하는 U net을 포함할 수 있다.In addition, the generator 410 uses the output of the LSTM-FC (Long Short Term Memory- Fully connected) network and the LSTM-FC network for processing the motion information and the time information 404 and the transmission video 401 to predict the image U net generating (411)(I ^HR ) may be included.

이때, 판별기(420)는 예측 이미지(411)(I^HR)와 전송 동영상(401)(I^LR)를 매칭한 페이크 페어(Fake pair)(405)를 이용하여 가짜 이미지를 판별하기 위한 페이크 로스(Fake loss)를 학습하는 판별기(421) 및 검증 정보(413)와 전송 동영상(401)(I^LR)를 매칭한 리얼 페어(Real pair)(406)를 이용하여 진짜 이미지를 판별하기 위한 리얼 로스(Real loss)를 학습하는 판별기(422)를 포함할 수 있다. 예를 들어, 판별기(421), 및 판별기(422)는 오토 인코더(Auto Encoder)일 수 있다. 또한, 검증 정보(413)은 현재 시간의 전송 동영상 또는 원본 동영상(I^GT)일 수 있다. 그리고, 페이크 로스는 페이크 페어에 대한 오토 인코더(판별기(421))의 손실인

이고, 리얼 로스는 리얼 페어에 대한 오토 인코더(판별기(422))의 손실인

일 수 있다.At this time, the discriminator 420 uses a fake pair 405 that matches the ^{predicted image 411 (I HR} ) and the transmission video 401 (I ^{LR ) to determine a fake image.} Real for discriminating a real image using a real pair 406 that matches the discriminator 421 for learning (fake loss) and the verification information 413 and the transmission video 401 (I ^{LR )} It may include a discriminator 422 for learning loss (Real loss). For example, the discriminator 421 and the discriminator 422 may be auto encoders. Also, the verification information 413 may be a transmission video or an original video I ^GT of the current time. And, the fake loss is the loss of the auto encoder (the discriminator 421) for the fake pair.

, and the real loss is the loss of the auto encoder (discriminator 422) for the real pair.

can be

이때, 동영상 생성 장치(100)는 판별기(420)를 학습하기 위하여 이전 시간에 생성된 전송 동영상(401)에 대응하는 테스트용 전송 동영상 및 현재 시간의 전송 동영상에 대응하는 검증 정보(413)를 수신할 수 있다. At this time, the video generating device 100 obtains the test transmission video corresponding to the transmission video 401 generated at the previous time and verification information 413 corresponding to the transmission video of the current time in order to learn the discriminator 420 . can receive

또한, 생성기(410)는 판별기(420)가 페이크 로스를 용이하게 학습할 수 있도록 동작 정보나 시간 정보(404)에 임의의 가중치를 부여함으로써, 고의적으로 전송 동영상(401)(I^LR)과 상이한 예측 이미지(411)를 생성할 수 있다. 이때, 판별기(421)는 전송 동영상(401)(I^LR)과 상이하게 생성한 예측 이미지를 이용하여 페이크 로스를 학습할 수 있다. In addition, the generator 410 intentionally transmits the video 401 (I ^LR ) and Different prediction images 411 may be generated. In this case, the discriminator 421 may learn the fake loss by using the prediction image generated differently ^{from the transmission video 401 (I LR ).}

그리고, 생성기(410)는 판별기(420)에서 페이크 로스의 학습이 완료되면, 이전에 생성한 예측 이미지에 비하여 전송 동영상(401)(I^LR)과 유사도가 더 높은 예측 이미지들을 생성하는 과정을 반복할 수 있다. 이때, 판별기(421)는 생성기(410)가 반복하여 생성하는 예측 이미지들과 전송 동영상을 이용하여 페이크 로스를 학습함으로써, 현재 시간의 전송 동영상과 차이가 있는 예측 이미지를 가짜 이미지로 판단할 가능성을 높일 수 있다.And, when the learning of fake cross is completed in the discriminator 420, the generator 410 generates the prediction images having a higher similarity to the ^{transmission video 401 (I LR ) compared to the previously generated prediction image.} Can be repeated. At this time, the discriminator 421 learns a fake cross using the predicted images and the transmitted video repeatedly generated by the generator 410, so that the prediction image that is different from the transmitted video of the current time is a fake image. can increase

또한, 판별기(422)는 검증 정보(413)와 전송 동영상(401)(I^LR)의 차이점에 따라 리얼 로스를 학습함으로써, 현재 시간의 전송 동영상과 유사도가 임계값 이상으로 높은 예측 이미지를 진짜 이미지로 판단할 가능성을 높일 수 있다.In addition, the discriminator 422 learns the real loss according to the difference between the verification information 413 and the transmission video 401 (I ^LR ), so that the prediction image having a similarity with the transmission video of the current time is higher than a threshold value as real. You can increase your chances of judging by an image.

또한, 판별기(420)는 검증 정보(413)을 타겟으로 예측 이미지(411)(I^HR)를 생성하는 방법을 학습할 수 있다. 이때, 판별기(420)는 Wasserstein 거리를 최적화하는 것을 목표로 손실 함수들을 학습할 수 있다. 예를 들어, 판별기(420)는 L1 손실 함수(L_L1:L1 loss), GDL(Image Gradient Difference loss) 손실 함수, 및 적대적 손실 함수(Adversarial loss) 중 적어도 하나를 학습할 수 있다. 이때, 적대적 손실 함수는 페이크 로스

및 리얼 로스

를 포함할 수 있다.Also, the discriminator 420 may learn a method of generating the ^{prediction image 411 (I HR ) by using the verification information 413 as a target.} In this case, the discriminator 420 may learn loss functions with the goal of optimizing the Wasserstein distance. For example, the discriminator 420 may _{learn at least one of an L1 loss function (L L1} : L1 loss), an image gradient difference loss (GDL) loss function, and an adversarial loss function. In this case, the adversarial loss function is the fake loss

and Real Loss

may include.

또한, 생성기(410)는 판별기(420)가 학습한 손실 함수들을 결합하여 예측 모델에 적용할 수 있다. 예를 들어, 생성기(410)는 예측 모델을 학습시키기 위한 생성 손실 함수(Generator loss)를 수학식 1과 같이 정의할 수 있다.Also, the generator 410 may combine the loss functions learned by the discriminator 420 and apply it to the predictive model. For example, the generator 410 may define a generator loss function for training the prediction model as in Equation 1.

예를 들어, 손실 계수인 λ_L1은 0.8이고, λ_GDL은 1이며, λ_adv는 0.2일 수 있다. 이때, 손실 계수는 복수의 샘플들에 의한 경험으로 결정되며, 실시예, 또는 샘플에 따라 다른 값을 가질 수도 있다. 또한, L1 손실 함수(L_L1)는 검증 정보(413)와 예측 이미지(411)(I^HR) 간의 차이에 따라 결정될 수 있다. 그리고, L_GDL은 GDL(Image Gradient Difference loss) 손실 함수이며, 예측 이미지(411)(I^HR)와 검증 정보(413)인 원본 동영상 (I^GT) 사이의 경계에서 손실을 최소화하기 위하여 사용될 수 있다. 예를 들어, L_GDL은 수학식 2와 같이 정의될 수 있다.For example, λ _{L1 that} is a loss factor may be 0.8, λ _GDL may be 1, and λ _adv may be 0.2. In this case, the loss coefficient is determined by experience with a plurality of samples, and may have a different value according to an embodiment or a sample. Also, the L1 loss function L _L1 may be determined according to a difference between the verification information 413 and the prediction image 411 (I ^{HR ).} And, L _GDL is an image gradient difference loss (GDL) loss function, and may be used to minimize the loss at the boundary between the ^{predicted image 411 (I HR} ) and the original video (I ^{GT ) that is the verification information 413 .} . For example, _LGDL may be defined as in Equation (2).

이때, i , j 는 예측 이미지(411)(I^HR) 및 원본 동영상 (I^GT)의 픽셀 위치일 수 있다. 또한,

는 기 설정된 상수이며, 2, 또는 다른 정수일 수 있다. 이때, GDL 손실 함수는 이미지의 edge에 강점을 부여하여 보다 명확한 이미지를 생성하도록 할 수 있다.In this case, i and j may be pixel positions ^{of the predicted image 411 (I HR} ) and the original video (I ^{GT ).} In addition,

is a preset constant, and may be 2 or another integer. In this case, the GDL loss function can generate a clearer image by giving strength to the edge of the image.

그리고, 페이크 로스

는 수학식 3과 같이 정의할 수 있다.and fake cross

can be defined as in Equation (3).

이때, RI^fake는 예측 이미지(411)(I^HR)와 전송 동영상(401)(I^LR)를 입력받은 판별기(421)의 출력일 수 있다. In this case, the RI ^fake may be an output of the discriminator 421 receiving the prediction image 411 (I ^HR ) and the transmission video 401 (I ^{LR ).}

또한, 리얼 로스

는 수학식 4와 같이 정의할 수 있다.Also, Real Loss

can be defined as in Equation (4).

이때, RI^real는 전송 동영상(401)(I^LR)과 검증 정보(413)를 입력받은 판별기(422)의 출력일 수 있다. 이때, 검증 정보(413)은 원본 동영상(I^GT)일 수 있다.In this case, RI ^real may be an output of the discriminator 422 receiving the transmission video 401 (I ^{LR ) and the verification information 413 .} In this case, the verification information 413 may be the original video I ^GT .

판별기(420)는 리얼 로스

를 최소화하고, 페이크 로스

를 최대화하도록 학습될 수 있다. 예를 들어, 판별기(420)의 판별기 손실(Discriminator loss)는 수학식 5와 같이 정의될 수 있다.The discriminator 420 is a real loss

to minimize, fake cross

can be trained to maximize For example, the discriminator loss of the discriminator 420 may be defined as in Equation 5.

이때, K_t는 리얼 로스

와 페이크 로스

사이의 균형(equilibrium)을 제어하는 업데이트 속도의 하이퍼 파라미터일 수 있다. 예를 들어, K_t의 초기값은 0이며, 수학식 6에 따라 결정될 수 있다.At this time, K _t is the real loss

and fake cross

It can be a hyperparameter of the update rate that controls the equilibrium between them. For example, _{the initial value of K t} is 0, and may be determined according to Equation (6).

이때, K_t는 0 내지 1 사이의 값을 가질 수 있다. 또한,

와

는 판별기(420)이 업데이트되기 전까지 K_t의 업데이트에 사용되는 하이퍼 파라미터일 수 있다. 예를 들어,

는 0.005이고,

는 0.7이나 실시예에 따라 다르게 결정될 수도 있다.In this case, K _t may have a value between 0 and 1. In addition,

Wow

may be a hyperparameter used for updating _{K t} until the discriminator 420 is updated. E.g,

is 0.005,

is 0.7, but may be determined differently depending on the embodiment.

도 5는 도 4에 도시된 생성기의 상세 도면이다. Fig. 5 is a detailed view of the generator shown in Fig. 4;

생성기(410)는 도 5에 도시된 바와 같이 U net(510) 및 LSTM-FC 네트워크(520)를 포함할 수 있다. The generator 410 may include a U net 510 and an LSTM-FC network 520 as shown in FIG. 5 .

LSTM-FC 네트워크(520)는 시계열 센서 데이터인 동작 정보, 및 시간 정보(404)를 예측 이미지(411)를 생성하기 위하여 필요한 특징 맵으로 매핑하기 위하여 사용될 수 있다. 이때, 동작 정보는 120x2 포맷의 로봇 팔이 오브젝트에 가하는 힘(402)(S^HR_f), 및 로봇 팔이 오브젝트를 잡는 각도(403)(S^HR_g)를 포함할 수 있다. 또한, 시간 정보는 전송 동영상(401)(I^LR)와 예측 이미지(411) 간의 시간 차를 의미하는 값이며, 시간 정보의 포맷은 1x1 특징(feature)일 수 있다.The LSTM-FC network 520 may be used to map motion information, which is time-series sensor data, and time information 404 to a feature map necessary to generate the prediction image 411 . In this case, the motion information may include a force 402 (S HR_f ) applied to the object by the robot arm in a 120x2 format, and an angle 403 (S ^HR_g ) at which the ^{robot arm grabs the object.} In addition, the time information is ^{a value indicating a time difference between the transmission video 401 (I LR} ) and the prediction image 411 , and the format of the time information may be a 1x1 feature.

LSTM-FC 네트워크(520)는 도 5에 도시된 바와 같이 LSTM layer 2층에 fully connected layer 1층으로 구성됩니다. 그리고, LSTM-FC 네트워크(520)는 4x4x64 포맷의 특징 맵(feature map)을 U net(510)으로 출력할 수 있다.As shown in FIG. 5 , the LSTM-FC network 520 is composed of a fully connected layer 1 and a layer 2 of the LSTM layer. In addition, the LSTM-FC network 520 may output a feature map in a 4x4x64 format to the U net 510 .

이때, 2층 구조 LSTM layers는 각각 64 ,128 weight 가질 수 있다. 그리고, 2층 구조 LSTM layers는 시간 순으로 입력 받은 동작 정보(402)에 따라 시간 순으로 1x 1218 포맷의 특징(feature)을 출력할 수 있다. 또한, 2층 구조 LSTM layers는 모두 tanh을 active function으로 사용할 수 있다.In this case, the two-layer structure LSTM layers may have 64 and 128 weights, respectively. In addition, the two-layer structure LSTM layers may output features in a 1x1218 format in chronological order according to the operation information 402 received in chronological order. In addition, all of the two-layer structure LSTM layers can use tanh as an active function.

Fully Connected layer는 2층 구조 LSTM layers에서 마지막으로 출력된 1x 1218 포맷의 특징과 시간 정보(404)를 병합한 1x129 포맷의 특징을 입력 받을 수 있다. 그리고, Fully Connected layer는 입력받은 1x129 포맷의 특징에 따라 1x1024 포맷의 특징을 출력할 수 있다.The fully connected layer may receive the 1x129 format feature in which the time information 404 is merged with the 1x 1218 format feature last output from the two-layer LSTM layers. In addition, the Fully Connected layer may output the features of the 1x1024 format according to the features of the received 1x129 format.

이때, Fully Connected layer의 출력은 로봇 팔이 오브젝트에 가하는 힘(402)(S^HR_f), 및 로봇 팔이 오브젝트를 잡는 각도(403)(S^HR_g)의 특징 벡터일 수 있다. 또한, LSTM-FC 네트워크(520)는 Fully Connected layer에서 출력된 1x1024 포맷의 특징을 U net(510)의 인코더의 출력의 포맷에 따라 4x4x64 특징 맵으로 변환(reshape)하여 U net(510)에 전달함으로써, Fully Connected layer의 출력이 U net(510)의 인코더(Encoder)의 출력과 병합될 수 있도록 할 수 있다,In this case, the output of the Fully Connected layer may be a feature vector of a force 402 (S HR_f ) applied by the robot arm to the object, and an angle 403 (S ^HR_g ^{) at which the robot arm grabs the object.} In addition, the LSTM-FC network 520 transforms (reshape) the 1x1024 format feature output from the Fully Connected layer into a 4x4x64 feature map according to the format of the output of the encoder of the U net 510 and delivers it to the U net 510. By doing so, the output of the Fully Connected layer can be merged with the output of the encoder (Encoder) of the U net 510,

U net(510)은 Encoder-decoder network에 skip-connections를 추가한 구조로 생성될 수 있다.U net 510 may be created in a structure in which skip-connections are added to the encoder-decoder network.

그리고, U net(510)은 128x128x3 포맷의 전송 동영상(401)과 LSTM-FC 네트워크(520)에서 출력된 4x4x64 포맷의 특징 맵을 입력받아 128x128x3 포맷의 예측 이미지(411)를 생성할 수 있다.In addition, the U net 510 may receive the transmission video 401 in the 128x128x3 format and the feature map in the 4x4x64 format output from the LSTM-FC network 520 to generate the prediction image 411 in the 128x128x3 format.

U net(510)의 인코더(Encoder)는 도 5에 도시된 바와 같이 일렬로 나열된 5개의 부호화 블록(Encoder Block)들로 구성될 수 있다. 이때, 부호화 블록들 각각에 표시된 숫자 n(42,48,64,96,128)은 각각의 부호화 블록에서 출력되는 특징의 채널 수일 수 있다. 그리고, 인코더(610)는128x128x3 포맷의 전송 동영상(401)를 부호화하여 4x4x128 포맷의 특징 맵(feature map)을 출력할 수 있다. The encoder of the U net 510 may be composed of five encoding blocks arranged in a line as shown in FIG. 5 . In this case, the number n(42,48,64,96,128) indicated in each of the coding blocks may be the number of channels of a feature output from each of the coding blocks. In addition, the encoder 610 may encode the transmission video 401 of the 128x128x3 format and output a feature map of the 4x4x128 format.

U net(510)의 디코더(Decoder)(620)는 도 5에 도시된 바와 같이 5개의 복호화 블록(Decoder Block)들로 구성될 수 있다. 이때, 복호화 블록들 각각에 표시된 숫자 n(96,64,48,32,3)은 각각의 복호화 블록에서 출력되는 특징의 채널 수일 수 있다. 이때, 디코더(620)는 인코더(610)에서 출력된 4x4x128 포맷의 특징 맵과 LSTM-FC 네트워크(520)에서 출력된 4x4x64 포맷의 특징 맵이 병합된 4x4x192 포맷의 특징 맵을 입력받을 수 있다. 그리고, 디코더(620)는 입력받은 특징 맵을 복호화하여 128x128x3 포맷의 예측 이미지(411)를 출력할 수 있다.The decoder 620 of the U net 510 may be composed of five decoding blocks as shown in FIG. 5 . In this case, the number n(96,64,48,32,3) indicated in each of the decoding blocks may be the number of channels of a feature output from each decoding block. In this case, the decoder 620 may receive a 4x4x192 format feature map in which the 4x4x128 format feature map output from the encoder 610 and the 4x4x64 format feature map output from the LSTM-FC network 520 are merged. In addition, the decoder 620 may decode the input feature map to output a prediction image 411 in a 128x128x3 format.

이때, 도 5에 도시된 바와 같이 인코더(610)의 부호화 블록들 중에서 마지막의 부호화 블록을 제외한 나머지 부호화 블록들은 스킵 커낵션즈(skip connections)을 통해 입력 크기가 동일한 디코더의 복호화 블록으로 출력 특징 맵(output feature map)를 전달 할 수 있다. 예를 들어, 4번째 부호화 블록은 96 채널의 특징을 출력하고, 2번째 복호화 블록은 1번째 복호화 블록에서 출력되는 96 채널의 특징을 수신할 수 있다. 따라서, 도 5에 도시된 바와 같이 4번째 부호화 블록이 상단에 64가 표시된 2번째 복호화 블록으로 출력 특징 맵을 전달할 수 있다.At this time, as shown in FIG. 5 , among the coding blocks of the encoder 610 , the remaining coding blocks except for the last coding block are output feature maps ( output feature map). For example, the fourth coding block may output the 96-channel feature, and the second decoding block may receive the 96-channel feature output from the first decoding block. Accordingly, as shown in FIG. 5 , the output feature map may be delivered to the second decoding block in which the fourth coding block is marked with 64 at the top.

따라서, 1번째 복호화 블록을 제외한 나머지 복호화 블록들은 이전 복호화 블록에서 출력되는 출력 특징 맵과 스킵 커낵션즈(skip connections)을 통해 전달된 부호화 블록의 출력 특징 맵이 병합된 입력 특징 맵(input feature map)을 입력 받을 수 있다.Accordingly, the remaining decoding blocks except for the first decoding block are input feature maps in which the output feature map output from the previous decoding block and the output feature map of the coding block delivered through skip connections are merged. can be input.

도 6는 도 5에 도시된 U-net의 상세 도면이다. 6 is a detailed view of the U-net shown in FIG.

인코더에 포함된 부호화 블록(610)들 각각은 도 6에 도시된 바와 같이 그림 5에서 볼 수 있는 Encoder Block은 H(height) x W(width) x C(channel) 입력 특징 맵(Input feature map)을 입력 받고, 크기가 1/4만큼 준 H/2 x W/2 x n 입력 특징 맵(Input feature map)을 출력할 수 있다. Each of the coding blocks 610 included in the encoder is H(height) x W(width) x C(channel) input feature map as shown in FIG. 6 as shown in FIG. is input, and an H/2 x W/2 xn input feature map with a size of 1/4 can be output.

이때, 부호화 블록은 컨볼루션 레이어(Conv: Convolution layer)와 Concat 레이어, 및 2개의 모듈로 구성될 수 있다. 이때, 모듈들은 도 6에 도시된 바와 같이 Conv(Convolution), BN(Batch normalization), Leaky Relu layer가 순차적으로 연결된 구조일 수 있다.In this case, the coding block may be composed of a convolution layer (Conv), a concat layer, and two modules. In this case, the modules may have a structure in which Convolution (Conv), Batch Normalization (BN), and Leaky Relu layers are sequentially connected as shown in FIG. 6 .

그리고, 부호화 블록(610)은 도 6에 도시된 바와 같이 Resnet의 스킵 커넥션(skip connection)을 통해 부호화 블록(610)이 입력받은 입력 특징 맵(input feature map)을 출력 직전으로 전달함으로써, 부호화 블록(610)에서 부호화된 특징 맵과 부호화 블록(610)이 입력받은 입력 특징 맵을 병합하여 출력할 수 있다.And, as shown in FIG. 6 , the coding block 610 transmits an input feature map received by the coding block 610 through a skip connection of Resnet immediately before output, thereby forming a coding block. The feature map encoded in 610 and the input feature map received by the encoding block 610 may be merged and output.

디코더에 포함된 복호화 블록(620)들 각각은 H(height) x W(width) x C(channel) 포맷의 특징 맵(feature map)을 입력받아 2H x 2W x n 포맷의 특징 맵을 출력할 수 잇다. Each of the decoding blocks 620 included in the decoder may receive a feature map of H (height) x W (width) x C (channel) format and output a feature map of 2H x 2W xn format. .

이때, 복호화 블록(620)들 각각은 컨볼루션 레이어와 2개의 모듈로 구성될 수 있다. 이때, 도 6에 도시된 바와 같이 첫번째 모듈은 Transpose Convolution layer, BN(Batch normalization), Leaky Relu layer가 순차적으로 연결된 구조일 수 있다. 또한, 두번째 모듈은 Conv(Convolution), BN(Batch normalization), Leaky Relu layer가 순차적으로 연결된 구조일 수 있다.In this case, each of the decoding blocks 620 may include a convolutional layer and two modules. At this time, as shown in FIG. 6 , the first module may have a structure in which a transpose convolution layer, a batch normalization (BN) layer, and a leaky relu layer are sequentially connected. Also, the second module may have a structure in which Convolution (Conv), Batch Normalization (BN), and Leaky Relu layers are sequentially connected.

또한, 복호화 블록(620)의 마지막 레이어는 디코더 블록의 마지막 계층은 1x1 커널 크기 및 1 보폭을 갖는 컨볼루션 레이어일 수 있다. 그리고, 도 6에서 복호화 블록(620)의 앞에 배치된 Concat 레이어는 이전 디코더 블록의 출력 특성을 U-net 스킵 연결을 통해 동일한 픽셀 해상도를 갖는 인코더 블록의 출력 특성과 연결할 수 있다.Also, the last layer of the decoding block 620 and the last layer of the decoder block may be a convolutional layer having a 1x1 kernel size and 1 stride. And, the Concat layer disposed in front of the decoding block 620 in FIG. 6 may connect the output characteristics of the previous decoder block with the output characteristics of the encoder block having the same pixel resolution through the U-net skip connection.

도 7은 도 4에 도시된 판별기의 상세 도면이다. FIG. 7 is a detailed view of the discriminator shown in FIG. 4 .

판별기(420)의 판별 모델은 도 6에 도시된 바와 같이 conditional BEGAN을 기반으로 설계될 수 있다. conditional BEGAN는 웨서스테인 거리(Wasserstein distance) 기반으로 손실 함수를 계산하므로, 판별기(420)는 자동 인코더(Auto-Encoder) 구조(architecture)로 생성될 수 있다.The discriminant model of the discriminator 420 may be designed based on conditional BEGAN as shown in FIG. 6 . Since the conditional BEGAN calculates a loss function based on a Wasserstein distance, the discriminator 420 may be generated as an Auto-Encoder architecture.

판별기(420)는 예측 이미지(411)와 전송 동영상(401)이 매칭된 페이크 페어(405), 또는 검증 정보(413)와 전송 동영상(401)이 매칭된 리얼 페어(406)를 입력받고, 가짜 이미지를 나타내는 가짜 특징 맵(fake feature map)(710), 또는 진짜 이미지를 나타내는 진짜 특징 맵(real feature map)(720)을 출력할 수 있다. 이때, 페이크 페어(405) 및 리얼 페어(406)는 128X128X6 포맷이고, 가짜 특징 맵(710), 및 진짜 특징 맵(720)은 128x128x3 포맷의 RGB일 수 있다.The discriminator 420 receives a fake pair 405 in which the predicted image 411 and the transmitted video 401 are matched, or a real pair 406 in which the verification information 413 and the transmitted video 401 are matched, A fake feature map 710 representing a fake image or a real feature map 720 representing a real image may be output. In this case, the fake pair 405 and the real pair 406 may be in a 128x128x6 format, and the fake feature map 710 and the real feature map 720 may be RGB in a 128x128x3 format.

또한, 판별 모델은 도 7에 도시된 바와 같이 생성기(410)의 인코더 및 디코더와 동일한 블록 구조로 형성되며, 인코더와 디코더 사이에 1x1 kernal size가지는 컨볼루션 레이더(convolution layer)로 연결될 수 있다.In addition, the discriminant model is formed in the same block structure as the encoder and decoder of the generator 410 as shown in FIG. 7 , and may be connected between the encoder and the decoder by a convolutional layer having a 1x1 kernel size.

도 8은 본 발명의 일실시예에 따른 제어 장치의 일례이다.8 is an example of a control device according to an embodiment of the present invention.

제어 장치(102)는 도 8에 도시된 바와 같이 오브젝트(820)을 잡고 이동시킬 수 있는 로봇 팔(810)일 수 있다. 이때, 로봇 팔(810)에는 오브젝트(820)를 촬영하는 카메라(830)이 결합될 수 있다. 이때, 카메라(830)는 도 1에 도시된 카메라(101)일 수 있다. 또한, 카메라(101)에는 카메라(830) 이외에 로봇 팔(810)이 아닌 다른 위치, 또는 장치에 배치되어 카메라(830)과 다른 각도에서 오브젝트(820)를 촬영하는 카메라를 더 포함할 수도 있다.The control device 102 may be a robot arm 810 capable of holding and moving the object 820 as shown in FIG. 8 . In this case, a camera 830 for photographing the object 820 may be coupled to the robot arm 810 . In this case, the camera 830 may be the camera 101 shown in FIG. 1 . In addition, the camera 101 may further include a camera disposed in a position other than the robot arm 810 or a device other than the camera 830 to photograph the object 820 at an angle different from the camera 830 .

로봇 팔(810)은 도 8에 도시된 바와 같이 오브젝트(820)를 잡기 위한 그립(840)과 그립(840) 안에 설치되며, 그립(840)과 오브젝트(820) 간의 상호 작용 힘을 측정하는 힘 센서(850)를 포함할 수 있다. 그리고, 그립(840)는 모터(845)에 의하여 오브젝트(820)가 위치한 방향으로 이동하여 오브젝트(820)를 잡고 오브젝트(820)에 힘을 가하거나, 오브젝트(820)의 반대 방향으로 이동하여 오브젝트(820)를 놓을 수 있다.The robot arm 810 is installed in the grip 840 and the grip 840 for holding the object 820 as shown in FIG. 8 , and a force for measuring the interaction force between the grip 840 and the object 820 . A sensor 850 may be included. Then, the grip 840 moves in the direction in which the object 820 is located by the motor 845 to hold the object 820 and applies a force to the object 820 , or moves in the opposite direction of the object 820 to the object. (820) may be placed.

또한, 도 8에서 로봇 팔(810)는 6관절을 가지고 있으나, 실시예에 따라 로봇 팔에 포함되는 관절의 개수는 변경될 수 있다. 또한, 로봇 팔(810)은 내장된 센서를 이용하여 관절들 각각의 각도 및 그립(840)의 그립 각도를 측정할 수 있다.In addition, although the robot arm 810 has six joints in FIG. 8 , the number of joints included in the robot arm may be changed according to an embodiment. In addition, the robot arm 810 may measure the angle of each of the joints and the grip angle of the grip 840 using a built-in sensor.

도 9는 본 발명의 일실시예에 따른 판별 모델을 학습하기 위하여 사용한 테스트 세트의 일례이다.9 is an example of a test set used to learn a discriminant model according to an embodiment of the present invention.

제어 장치(102)가 오브젝트(910)에 동일한 힘을 가하더라도, 오브젝트(910)의 타입, 배경, 조명, 오브젝트(910)의 배치 각도에 따라 원본 이미지가 다르게 생성될 수 있다.Even if the control device 102 applies the same force to the object 910 , the original image may be differently generated according to the type of the object 910 , the background, the lighting, and the arrangement angle of the object 910 .

따라서, 판별 모델을 학습하기 위하여 사용하는 테스트 세트에는 서로 다른 타입의 오브젝트(910), 배경, 조명, 오브젝트(910)의 배치 각도에 따라 촬영한 테스트용 원본 동영상과, 각각의 조건에 대응하는 동작 정보 및 위치 정보가 포함될 수 있다.Therefore, in the test set used to learn the discriminant model, there are different types of objects 910 , backgrounds, lights, and original videos for testing taken according to the arrangement angles of the objects 910 , and motions corresponding to each condition. Information and location information may be included.

예를 들어, 제어 장치(102)가 오브젝트(910)를 잡고 있는 상태에서 추가적으로 오브젝트(910)에 힘을 가할 수 있다. 이때, 오브젝트(910)가 종이 컵(911)인 경우, 제어 장치(102)가 종이 컵(911)을 잡은 위치에 따라 종이 컵(911)의 변형이 다르게 발생할 수 있다. 예를 들어, 종이 컵(911)의 측면 중 하단은 종이 컵(911)의 바닥에 의하여 형상이 고정되어 있으므로, 일정 이하의 힘에 의해서는 변형이 발생하지 않을 수 있다. 반면, 종이 컵(911)의 측면 중 상단은 형상을 고정하기 위한 별도의 구성이 없으므로, 일정 이하의 힘에 의해서도 용이하게 변형이 발생할 수 있다.For example, the control device 102 may additionally apply a force to the object 910 while holding the object 910 . In this case, when the object 910 is the paper cup 911 , the deformation of the paper cup 911 may occur differently depending on the position where the control device 102 holds the paper cup 911 . For example, since the shape of the lower end of the side surfaces of the paper cup 911 is fixed by the bottom of the paper cup 911, deformation may not occur by a force less than a certain amount. On the other hand, since the upper end of the side surfaces of the paper cup 911 does not have a separate configuration for fixing the shape, it can be easily deformed even by a force less than a certain amount.

또한, 오브젝트(910)가 유리 병(912)인 경우, 종이 컵(911)이 변형할 정도의 힘을 가하더라도 유리 병(912)은 변형이 발생하지 않을 수 있다. 이때, 유리 병(912)은 힘이 가해짐에 따라 힘의 방향에 대응하는 움직임이 발생할 수 있다.In addition, when the object 910 is the glass bottle 912 , the glass bottle 912 may not be deformed even if a force sufficient to deform the paper cup 911 is applied. At this time, as the force is applied to the glass bottle 912, a movement corresponding to the direction of the force may occur.

그리고, 오브젝트(910)가 스펀지(913)인 경우, 종이 컵(911)이 변형하지 않을 정도의 힘에 의해서도 변형이 발생하며, 제어 장치(102)가 가하는 힘의 크기 및 방향에 따라 형상이 다양하게 변형될 수 있다.In addition, when the object 910 is a sponge 913 , deformation occurs even by a force that does not deform the paper cup 911 , and the shape varies depending on the magnitude and direction of the force applied by the control device 102 . can be deformed to

따라서, 테스트 세트에는 제어 장치(102)의 동작 정보 및 동작 정보에 따라 동작한 제어 장치(102)에 의한 오브젝트(910)의 변형, 또는 이동을 촬영한 원본 동영상을 오브젝트(910)의 타입별로 포함할 수 있다.Accordingly, in the test set, the motion information of the control device 102 and the original video recording the deformation or movement of the object 910 by the control device 102 operated according to the motion information are included for each type of the object 910 . can do.

또한, 테스트 세트에는 도 9의 케이스 1(Case 1)에 도시된 바와 같이 다른 오브젝트가 배경으로 촬영된 원본 동영상 및 도 9의 케이스 3(Case 3)에 도시된 바와 같이 날씨, 조명의 위치 별로 촬영된 원본 동영상이 포함될 수 있다.In addition, in the test set, as shown in Case 1 of FIG. 9, an original video shot with another object as a background, and as shown in Case 3 (Case 3) of FIG. The original video may be included.

그리고, 앞서 설명한 바와 같이 오브젝트(910)가 종이 컵(911)인 경우, 제어 장치(102)가 종이 컵(911)을 잡은 위치에 따라 변형 여부가 결정될 수도 있다. 따라서, 테스트 세트에는 도 9의 케이스 2(Case 2)에 도시된 바와 같이 제어 장치(102)가 오브젝트(910)을 잡은 위치 별로 촬영된 원본 동영상, 및 도 9의 케이스 2(Case 2)에 도시된 바와 같이 제어 장치(102)가 오브젝트(910)을 잡은 방향 별로 촬영된 원본 동영상이 포함될 수 있다.In addition, as described above, when the object 910 is the paper cup 911 , whether the object 910 is deformed may be determined according to the position where the control device 102 holds the paper cup 911 . Accordingly, in the test set, as shown in Case 2 of FIG. 9 , the original video captured for each position where the control device 102 grabbed the object 910 , and as shown in Case 2 of FIG. 9 . As described above, the original video captured for each direction in which the control device 102 holds the object 910 may be included.

도 10은 본 발명의 일실시예에 따른 학습 네트워크에 대한 테스트 세트의 재구성 과정을 나타내는 도면이다. 10 is a diagram illustrating a reconstruction process of a test set for a learning network according to an embodiment of the present invention.

현재 시간(시간 t)에서 카메라(101)는 오브젝트를 촬영하여 테스트용 동영상(1010)(I^GT)을 생성할 수 있다. 그리고, 제어 장치(102)는 제어 장치(102)가 오브젝트에 가하는 힘(1040)(S^HR_f), 및 제어 장치(102)가 오브젝트를 잡는 각도(1050)(S^HR_g)를 측정하여 테스트 세트 생성 장치로 전송할 수 있다.At the current time (time t), the camera 101 may capture an object to generate a test video 1010 (I ^GT ). Then, the control device 102 generates a test set by measuring the force 1040 (S HR_f ) that the control device 102 applies to the object, and the angle 1050 (S ^HR_g ^{) at which the control device 102 grabs the object.} can be sent to the device.

이때, 테스트 세트 생성 장치는 이전 시간(시간 t-1)의 동영상(1020)을 수신할 수 있다. 예를 들어, 이전 시간(시간 t-1)의 동영상(1020)은 이전 시간(시간 t-1)의 테스트용 원본 동영상에서 일부의 프레임들을 추출하여 생성된 이전 시간(시간 t-1)의 테스트용 전송 동영상 I^LR일 수 있다.In this case, the test set generating apparatus may receive the video 1020 of the previous time (time t-1). For example, the video 1020 of the previous time (time t-1) is a test of the previous time (time t-1) generated by extracting some frames from the original video for testing at the previous time (time t-1) For transmission video I can be ^LR.

그리고, 일정 시간이 경과한 시간 t+1에서, 테스트 세트 생성 장치는 현재 시간(시간 t)의 테스트용 동영상(1010)을 수신할 수 있다. 이때, 테스트 세트 생성 장치는 이전 시간(시간 t-1)의 동영상(1020), 오브젝트에 가하는 힘(1040)(S^HR_f), 및 제어 장치(102)가 오브젝트를 잡는 각도(1050)(S^HR_g)을 수신한 시간 t와 현재 시간(시간 t)의 테스트용 동영상(1010)(I^GT)을 수신한 시간 t+1 간의 차이를 시간 정보인 지연 시간 T^D(1060)로 저장할 수 있다.In addition, at time t+1 when a predetermined time has elapsed, the test set generating apparatus may receive the test video 1010 of the current time (time t). At this time, the test set generating device includes a moving picture 1020 of the previous time (time t-1), a force 1040 (S HR_f ) applied to the object, and an angle 1050 (S ^HR_g ^{) at which the control device 102 grabs the object.} ) you can store a difference between the received time t and time t + 1 receives a video (1010) ^(GT I) for testing the current time (time t) to the time information of the delay time t ^D (1060).

그리고, 테스트 세트 생성 장치는 시간 t의 테스트용 동영상(1010)(I^GT), 이전 시간(시간 t-1)의 동영상(1020), 오브젝트에 가하는 힘(1040)(S^HR_f), 제어 장치(102)가 오브젝트를 잡는 각도(1050)(S^HR_g), 지연 시간 T^D(1060)을 그룹화하여 테스트 세트(1000)를 생성할 수 있다. In addition, the test set generating device includes a test video 1010 (I ^GT ) at time t, a video 1020 at a previous time (time t-1), a force 1040 (S ^HR_f ) applied to the object, and a control device ( The test set 1000 may be generated by grouping the ^{angle 1050 (S HR_g} ) at which 102 holds the object and the delay time T ^{D 1060 .}

도 11은 본 발명의 일실시예에 따라 오브젝트 별로 생성한 예측 이미지 및 원본 동영상의 일례이다. 11 is an example of a predicted image and an original video generated for each object according to an embodiment of the present invention.

도 11에서 원본 동영상(Ground Truth)은 초당 120 프레임을 가지는 동영상이며, 전송 동영상(I^GT)는 원본 동영상에서 25프레임 마다 1프레임씩 추출하여 생성된 동영상일 수 있다. 지연 시간은 1초일 수 있다. 따라서, 동영상 생성 장치(100)는 1초 전의 전송 동영상(I^GT)을 이용하여 현재 시간의 예측 이미지를 생성할 수 있다.In FIG. 11 , the original video (ground truth) is a video having 120 frames per second, and the transmission video ( ^IGT ) may be a video generated by extracting one frame every 25 frames from the original video. The delay time may be 1 second. Accordingly, the video generating apparatus 100 may generate a prediction image of the current time by using the ^{transmitted video I GT of one second before.}

도 11의 케이스 1(Case 1)은 오브젝트가 종이 컵인 경우의 원본 동영상(1112)과 예측 이미지(1111)의 일례이다.Case 1 of FIG. 11 is an example of an original video 1112 and a predicted image 1111 when the object is a paper cup.

또한, 도 11의 케이스 2(Case 2)은 오브젝트가 스폰지인 경우의 원본 동영상(1122)과 예측 이미지(1121)의 일례이다.In addition, Case 2 (Case 2) of FIG. 11 is an example of the original video 1122 and the prediction image 1121 when the object is a sponge.

그리고, 도 11의 케이스 3(Case 3)은 오브젝트가 유리병인 경우의 원본 동영상(1132)과 예측 이미지(1131)의 일례이다.And, Case 3 (Case 3) of FIG. 11 is an example of the original video 1132 and the prediction image 1131 when the object is a glass bottle.

도 11에 따르면 동영상 생성 장치(100)는 전송 동영상에 포함되지 않은 15 프레임, 40프레임, 65 프레임, 90 프레임, 115 프레임에 대하여 카메라(101)에서 실시간으로 촬영되고 있는 원본 동영상과 동일하거나, 유사도가 임계값 이상인 예측 이미지를 생성할 수 있다. 따라서, 동영상 생성 장치(100)가 예측 이미지들을 사용하여 생성하는 출력 동영상도 원본 동영상과 동일하거나, 유사도가 임계값 이상인 동영상일 수 있다.According to FIG. 11 , the video generating apparatus 100 has the same or similarity to the original video being captured by the camera 101 in real time with respect to 15 frames, 40 frames, 65 frames, 90 frames, and 115 frames not included in the transmitted video. It is possible to generate a predictive image of which is equal to or greater than a threshold value. Accordingly, the output video generated by the video generating apparatus 100 using the prediction images may be the same as the original video or may be a video having a similarity greater than or equal to a threshold value.

도 12은 본 발명의 일실시예에 따라 생성한 동영상과 기존 방법에 따라 생성한 동영상 및 원본 동영상의 일례이다. 12 is an example of a moving picture generated according to an embodiment of the present invention, a moving picture generated according to the existing method, and an original moving picture.

도 12에서 (a)(1220)는 동영상 생성 장치(100)가 생성한 예측 이미지들이고, (b)(1230)는 WithoutSensor 방법에 따라 생성된 예측 이미지들일 수 있다. 또한, (c)(1240)는 오토 인코더 방법에 따라 생성된 예측 이미지들이고, (d)(1250)는 DCGAN(deep convolutional GAN) 방법에 따라 생성된 예측 이미지들일 수 있다. 그리고, (e)(1260)는 BEGAN 방법에 따라 생성된 예측 이미지들일 수 있다.12, (a) 1220 may be prediction images generated by the video generating apparatus 100, and (b) 1230 may be prediction images generated according to the WithoutSensor method. Also, (c) 1240 may be prediction images generated according to the auto-encoder method, and (d) 1250 may be prediction images generated according to a deep convolutional GAN (DCGAN) method. And, (e) 1260 may be prediction images generated according to the BEGAN method.

이때, WithoutSensor 방법에 따라 생성된 예측 이미지들인 (b)(1230)는 동작 정보 및 위치 정보를 사용하지 않았으므로 도 12에 도시된 바와 같이 로봇 팔의 움직임이나 오브젝트의 변형을 반영하지 못할 수 있다. 또한, 오토 인코더 방법에 따라 생성된 예측 이미지들인 (c)(1240)는 오브젝트의 변형, 또는 이동이 있는 프레임에서 이미지의 퀄리티가 감소할 수 있다.At this time, since motion information and position information are not used in (b) 1230, which are prediction images generated according to the WithoutSensor method, as shown in FIG. 12 , the movement of the robot arm or the deformation of the object may not be reflected. Also, in (c) 1240, which are prediction images generated according to the auto-encoder method, the quality of the image may be reduced in a frame in which an object is deformed or moved.

그리고, DCGAN 방법에 따라 생성된 예측 이미지들인 (d)(1250)는 로봇 팔이 오브젝트를 잡는 동작은 나타내고 있으나, 이미지의 텍스처 및 엣지에 블러 현상이 발생할 수 있다. 또한, (d)(1250)는 로봇 팔의 움직임이 원본 동영상(Ground Truth)(1270)보다 늦는 지연 현상이 발생하고, 로봇 팔의 압력에 의한 오브젝트의 변형이 표시되지 않을 수 있다.In addition, (d) 1250, which are prediction images generated according to the DCGAN method, shows the motion of the robot arm grabbing the object, but blurring may occur in the texture and edge of the image. Also, in (d) 1250 , a delay phenomenon occurs in that the motion of the robot arm is later than that of the ground truth 1270 , and the deformation of the object due to the pressure of the robot arm may not be displayed.

그리고, BEGAN 방법에 따라 생성된 예측 이미지들인 (d)(1250)는 도 12에 도시된 바와 같이 오브젝트와 로봇 팔의 형상만 형성될 뿐, 오브젝트의 변형, 또는 이동이 표시되지 않을 수 있다.In addition, as shown in FIG. 12 , in (d) 1250 , which are prediction images generated according to the BEGAN method, only the shapes of the object and the robot arm are formed, and deformation or movement of the object may not be displayed.

즉, 도 12에 도시된 바와 같이 동영상 생성 장치(100)가 생성한 예측 이미지들인 (a)(1220)는 다른 방법에 따라 생성된 예측 이미지들 보다 원본 동영상(Ground Truth)(1270)과의 유사도가 높을 수 있다. That is, as shown in FIG. 12 , prediction images (a) 1220 generated by the video generating apparatus 100 have a degree of similarity to the original video (Ground Truth) 1270 than predicted images generated according to other methods. can be high

도 13은 본 발명의 일실시예에 따른 RMSE, PSNR 및 SSIM의 일례이다. 13 is an example of RMSE, PSNR and SSIM according to an embodiment of the present invention.

도 13의 그래프들에서 선(1310)는 동영상 생성 장치(100)의 성능을 측정하여 결정된 RMSE (root mean squared error), PSNR (peak signal-to noise ratio) 및 SSIM (structure similarity index)의 일례이다. 또한, 선(1320)는 WithoutSensor 방법의 성능을 측정하여 결정된 RMSE, PSNR 및 SSIM의 일례이다.A line 1310 in the graphs of FIG. 13 is an example of a root mean squared error (RMSE), a peak signal-to noise ratio (PSNR), and a structure similarity index (SSIM) determined by measuring the performance of the video generating apparatus 100 . . Also, line 1320 is an example of RMSE, PSNR, and SSIM determined by measuring the performance of the WithoutSensor method.

그리고, 선(1330)는 오토 인코더 방법의 성능을 측정하여 결정된 RMSE, PSNR 및 SSIM의 일례이다. 또한, 선(1340)는 DCGAN 방법의 성능을 측정하여 결정된 RMSE, PSNR 및 SSIM의 일례이다. 그리고, 선(1350)는 BEGAN 방법의 성능을 측정하여 결정된 RMSE, PSNR 및 SSIM의 일례이다.And, the line 1330 is an example of the RMSE, PSNR, and SSIM determined by measuring the performance of the auto-encoder method. Also, line 1340 is an example of RMSE, PSNR, and SSIM determined by measuring the performance of the DCGAN method. And, the line 1350 is an example of the RMSE, PSNR, and SSIM determined by measuring the performance of the BEGAN method.

또한, 표 1은 동영상 생성 장치(100)와 다른 방법들의 결정된 RMSE, PSNR 및 SSIM의 측정 일례를 나타내는 표이다. 표 1에서 Proposed는 동영상 생성 장치(100)가 수행하는 본 발명의 일실시예에 따른 동영상 생성 방법일 수 있다.In addition, Table 1 is a table showing an example of measurement of the determined RMSE, PSNR, and SSIM of the video generating apparatus 100 and other methods. In Table 1, Proposed may be a video generating method according to an embodiment of the present invention performed by the video generating apparatus 100 .

도 13 및 표 1에 따르면 본 발명의 일실시예에 따른 동영상 생성 방법은 WithoutSensor 방법, 오토 인코더 방법, DCGAN 방법, 및 BEGAN 방법에 따른 동영상 생성보다 원본 동영상과의 차이가 적은 출력 동영상을 생성할 수 있다.13 and Table 1, the video generation method according to an embodiment of the present invention can generate an output video having a smaller difference from the original video than the video generation according to the WithoutSensor method, the auto encoder method, the DCGAN method, and the BEGAN method. have.

도 14는 본 발명의 일실시예에 따른 고해상도 동영상 생성 방법을 도시한 플로우차트이다.14 is a flowchart illustrating a method of generating a high-resolution video according to an embodiment of the present invention.

단계(1410)에서 이미지 수신부(110)는 카메라(101)로부터 고해상도의 원본 동영상에 기초한 저해상도의 전송 동영상을 수신할 수 있다. 이때, 전송 동영상은 오브젝트들 간의 상호 작용을 촬영하여 생성한 고해상도의 원본 동영상에서 추출한 일부의 프레임들로 구성된 동영상일 수 있다. 또한, 전송 동영상은, 고해상도의 원본 동영상이 촬영된 시점으로부터 일정 시간 이상 지연된 시간에 수신되는 이전 시간의 동영상일 수 있다.In operation 1410 , the image receiver 110 may receive a low-resolution transmission video based on a high-resolution original video from the camera 101 . In this case, the transmission video may be a video composed of some frames extracted from a high-resolution original video generated by photographing interactions between objects. In addition, the transmission video may be a video of a previous time received at a time delayed by a predetermined time or more from the point in time when the original high-resolution video was captured.

단계(1420)에서 동작 정보 수신부(120)는 전송 동영상에 포함된 오브젝트들 간의 상호 작용과 관련된 제어 장치(102)로부터 제어 장치(102)의 동작 정보, 및 오브젝트들, 또는 제어 장치(102)의 위치 정보를 수신할 수 있다.In step 1420 , the motion information receiving unit 120 receives motion information of the control device 102 and the objects or the control device 102 from the control device 102 related to the interaction between the objects included in the transmission video. You can receive location information.

단계(1430)에서 예측 이미지 생성부(130)는 단계(1410)에서 수신한 전송 동영상과 단계(1410)에서 수신한 동작 정보 및 위치 정보를 이용하여 원본 동영상에 포함되어 있으나 상기 전송 동영상에 누락된 프레임에서의 오브젝트들 간의 상호 작용 힘(Interaction force)의 변화에 대한 예측 이미지들을 생성할 수 있다. 이때, 예측 이미지 생성부(130)는 단계(1410)에서 수신한 동작 정보 및 위치 정보를 이용하여 전송 동영상에 누락된 프레임 및 전송 동영상에 포함된 프레임의 현재 시간에서의 오브젝트들 간의 상호 작용 힘의 변화에 대한 예측 이미지들을 생성할 수 있다.In step 1430, the prediction image generator 130 is included in the original video using the transmitted video received in step 1410 and the motion information and location information received in step 1410, but is missing from the transmitted video. Predictive images for a change in interaction force between objects in a frame may be generated. At this time, the prediction image generator 130 uses the motion information and location information received in step 1410 to determine the interaction force between the objects at the current time of the frame included in the transmission video and the frame missing in the transmission video. Predictive images of changes can be generated.

단계(1440)에서 판별부(140)는 판별 모델을 이용하여 예측 이미지들을 진짜 이미지 또는 가짜 이미지(fake image)로 판별할 수 있다.In operation 1440 , the determination unit 140 may determine the predicted images as real images or fake images by using the discrimination model.

단계(1450)에서 동영상 생성부(150)는 단계(1440)에서 진짜 이미지로 판별된 예측 이미지들을 결합하여 원본 동영상에 대응되는 고해상도의 출력 동영상을 생성할 수 있다.In step 1450 , the video generating unit 150 may combine the predicted images determined as the real image in step 1440 to generate a high-resolution output video corresponding to the original video.

도 15는 본 발명의 일실시예에 따른 현재 시간의 동영상 생성 방법을 도시한 플로우차트이다.15 is a flowchart illustrating a method of generating a video of the current time according to an embodiment of the present invention.

단계(1510)에서 이미지 수신부(110)는 카메라(101)로부터 전송 과정에서의 지연에 따라 이전 시간의 동영상을 수신할 수 있다. 이때, 이전 시간의 동영상은, 카메라가 이전 시간에 오브젝트들 간의 상호 작용을 촬영하여 생성한 고해상도의 원본 동영상에서 추출한 일부의 프레임들로 구성된 저해상도의 전송 동영상일 수 있다.In step 1510 , the image receiver 110 may receive a video of a previous time according to a delay in the transmission process from the camera 101 . In this case, the video of the previous time may be a low-resolution transmission video composed of some frames extracted from the original high-resolution video generated by the camera capturing the interaction between objects at the previous time.

단계(1520)에서 동작 정보 수신부(120)는 제어 장치(102)로부터 제어 장치(102)의 현재 동작 정보, 및 오브젝트들, 또는 제어 장치(102)의 현재 위치 정보를 수신할 수 있다.In operation 1520 , the motion information receiver 120 may receive current motion information of the control device 102 and objects or current location information of the control device 102 from the control device 102 .

단계(1530)에서 예측 이미지 생성부(130)는 이전 시간의 동영상과 현재 동작 정보 및 현재 위치 정보를 이용하여 현재 시간에서의 오브젝트들 간의 상호 작용 힘(Interaction force)의 변화에 대한 예측 이미지들을 생성할 수 있다. 이때, 예측 이미지 생성부(130)는 원본 동영상에 포함되어 있으나 전송 동영상에 누락된 프레임의 현재 시간에서의 오브젝트들 간의 상호 작용 힘의 변화에 대한 예측 이미지들을 생성할 수 있다.In step 1530, the prediction image generator 130 generates prediction images for a change in interaction force between objects at the current time by using the video of the previous time, the current motion information, and the current location information. can do. In this case, the prediction image generator 130 may generate prediction images for a change in the interaction force between objects at a current time of a frame included in the original video but missing in the transmission video.

단계(1540)에서 판별부(140)는 판별 모델을 이용하여 예측 이미지들을 진짜 이미지 또는 가짜 이미지(fake image)로 판별할 수 있다.In operation 1540 , the determining unit 140 may determine the predicted images as real images or fake images using the discrimination model.

단계(1550)에서 동영상 생성부(150)는 단계(1540)에서 진짜 이미지로 판별된 예측 이미지들을 결합하여 이미지 수신부(110)가 아직 수신하지 못한 현재 시간의 동영상을 생성함으로써, 실시간으로 동영상을 출력할 수 있다.In step 1550, the video generating unit 150 combines the predicted images determined as the real image in step 1540 to generate a video of the current time that the image receiving unit 110 has not yet received, thereby outputting a video in real time. can do.

본 발명은 오브젝트를 제어하는 제어 장치의 동작 정보, 및 오브젝트들, 또는 제어 장치의 위치 정보를 이용하여 고해상도의 원본 동영상의 프레임들 중 일부로 구성된 저해상도의 전송 동영상으로부터 고해상도의 출력 동영상을 생성할 수 있다.The present invention can generate a high-resolution output video from a low-resolution transmission video composed of some of the frames of the high-resolution original video using motion information of a control device that controls the object, and the objects or location information of the control device. .

이때, 본 발명은 이미지에 비하여 데이터가 작은 동작 정보 및 위치 정보를 이용하여 저해상도의 전송 동영상에 따른 예측 이미지의 정확도를 증가시킴으로써, 고해상도의 출력 동영상과 카메라에서 촬영된 고해상도의 원본 동영상 간의 유사도를 높이면서 동영상 전송에 소요되는 데이터의 증가를 최소화할 수 있다.At this time, the present invention increases the accuracy of the predicted image according to the low-resolution transmission video by using motion information and location information that have smaller data than the image, thereby increasing the similarity between the high-resolution output video and the high-resolution original video captured by the camera. while minimizing the increase in data required for video transmission.

또한, 본 발명은 전송 과정에 따라 지연되어 수신한 이전 시간의 동영상과 실시간으로 수신하는 동작 정보 및 위치 정보를 이용하여, 카메라에서 촬영되었으나 아직 수신하지 못한 현재 시간의 동영상을 생성함으로써, 오브젝트와 관련된 동영상을 실시간으로 출력할 수 있다.In addition, the present invention generates a video of the current time that has been captured by the camera but has not yet been received by using the video of the previous time received delayed according to the transmission process and motion information and location information received in real time, thereby generating a video related to an object. Video can be output in real time.

본 출원은 아래의 지원을 통하여 도출된 발명에 관한 것이다. This application relates to the invention derived through the support below.

[과제고유번호] SRFC-TB1703-02[Project identification number] SRFC-TB1703-02

[부 처 명] 삼성전자 미래기술육성센터[Name of department] Samsung Electronics Future Technology Development Center

[연구 사업명] ICT 창의 과제[Research project name] ICT creative task

[연구 과제명]영상 학습에 기반한 물리적 감각 재현을 위한 Interaction Force 예측 기술 개발[Research project name] Development of Interaction Force prediction technology for physical sensation reproduction based on video learning

[주 관 기 관] 삼성전자㈜[Organization] Samsung Electronics Co., Ltd.

[연 구 기 간] 2017. 09. 01. ~ 2020. 08. 31.[Research period] 2017. 09. 01. ~ 2020. 08. 31.

한편, 본 발명에 따른 동영상 생성 장치 또는 동영상 생성 방법은 컴퓨터에서 실행될 수 있는 프로그램으로 작성되어 마그네틱 저장매체, 광학적 판독매체, 디지털 저장매체 등 다양한 기록 매체로도 구현될 수 있다.On the other hand, the moving picture generating apparatus or the moving picture generating method according to the present invention is written as a program that can be executed on a computer and can be implemented in various recording media such as a magnetic storage medium, an optical reading medium, and a digital storage medium.

본 명세서에 설명된 각종 기술들의 구현들은 디지털 전자 회로조직으로, 또는 컴퓨터 하드웨어, 펌웨어, 소프트웨어로, 또는 그들의 조합들로 구현될 수 있다. 구현들은 데이터 처리 장치, 예를 들어 프로그램가능 프로세서, 컴퓨터, 또는 다수의 컴퓨터들의 동작에 의한 처리를 위해, 또는 이 동작을 제어하기 위해, 컴퓨터 프로그램 제품, 예를 들어 기계 판독가능 저장 장치(컴퓨터 판독가능 매체)에서 유형적으로 구체화된 컴퓨터 프로그램으로서 구현될 수 있다. 상술한 컴퓨터 프로그램(들)과 같은 컴퓨터 프로그램은 컴파일된 또는 인터프리트된 언어들을 포함하는 임의의 형태의 프로그래밍 언어로 기록될 수 있고, 독립형 프로그램으로서 또는 모듈, 구성요소, 서브루틴, 또는 컴퓨팅 환경에서의 사용에 적절한 다른 유닛으로서 포함하는 임의의 형태로 전개될 수 있다. 컴퓨터 프로그램은 하나의 사이트에서 하나의 컴퓨터 또는 다수의 컴퓨터들 상에서 처리되도록 또는 다수의 사이트들에 걸쳐 분배되고 통신 네트워크에 의해 상호 연결되도록 전개될 수 있다.Implementations of the various techniques described herein may be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or combinations thereof. Implementations may be implemented for processing by, or controlling the operation of, a data processing device, eg, a programmable processor, computer, or number of computers, in a computer program product, eg, a machine readable storage device (computer readable available medium) as a computer program tangibly embodied in it. A computer program, such as the computer program(s) described above, may be written in any form of programming language, including compiled or interpreted languages, as a standalone program or in a module, component, subroutine, or computing environment. It can be deployed in any form, including as other units suitable for use in A computer program may be deployed to be processed on one computer or multiple computers at one site or to be distributed across multiple sites and interconnected by a communications network.

컴퓨터 프로그램의 처리에 적절한 프로세서들은 예로서, 범용 및 특수 목적 마이크로프로세서들 둘 다, 및 임의의 종류의 디지털 컴퓨터의 임의의 하나 이상의 프로세서들을 포함한다. 일반적으로, 프로세서는 판독 전용 메모리 또는 랜덤 액세스 메모리 또는 둘 다로부터 명령어들 및 데이터를 수신할 것이다. 컴퓨터의 요소들은 명령어들을 실행하는 적어도 하나의 프로세서 및 명령어들 및 데이터를 저장하는 하나 이상의 메모리 장치들을 포함할 수 있다. 일반적으로, 컴퓨터는 데이터를 저장하는 하나 이상의 대량 저장 장치들, 예를 들어 자기, 자기-광 디스크들, 또는 광 디스크들을 포함할 수 있거나, 이것들로부터 데이터를 수신하거나 이것들에 데이터를 송신하거나 또는 양쪽으로 되도록 결합될 수도 있다. 컴퓨터 프로그램 명령어들 및 데이터를 구체화하는데 적절한 정보 캐리어들은 예로서 반도체 메모리 장치들, 예를 들어, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(Magnetic Media), CD-ROM(Compact Disk Read Only Memory), DVD(Digital Video Disk)와 같은 광 기록 매체(Optical Media), 플롭티컬 디스크(Floptical Disk)와 같은 자기-광 매체(Magneto-Optical Media), 롬(ROM, Read Only Memory), 램(RAM, Random Access Memory), 플래시 메모리, EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM) 등을 포함한다. 프로세서 및 메모리는 특수 목적 논리 회로조직에 의해 보충되거나, 이에 포함될 수 있다.Processors suitable for processing a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. In general, a processor will receive instructions and data from read only memory or random access memory or both. Elements of a computer may include at least one processor that executes instructions and one or more memory devices that store instructions and data. In general, a computer may include one or more mass storage devices for storing data, for example magnetic, magneto-optical disks, or optical disks, receiving data from, sending data to, or both. may be combined to become Information carriers suitable for embodying computer program instructions and data are, for example, semiconductor memory devices, for example, magnetic media such as hard disks, floppy disks and magnetic tapes, Compact Disk Read Only Memory (CD-ROM). ), optical recording media such as DVD (Digital Video Disk), magneto-optical media such as optical disk, ROM (Read Only Memory), RAM (RAM) , Random Access Memory), flash memory, EPROM (Erasable Programmable ROM), EEPROM (Electrically Erasable Programmable ROM), and the like. Processors and memories may be supplemented by, or included in, special purpose logic circuitry.

또한, 컴퓨터 판독가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용매체일 수 있고, 컴퓨터 저장매체를 모두 포함할 수 있다.Also, the computer-readable medium may be any available medium that can be accessed by a computer, and may include any computer storage medium.

본 명세서는 다수의 특정한 구현물의 세부사항들을 포함하지만, 이들은 어떠한 발명이나 청구 가능한 것의 범위에 대해서도 제한적인 것으로서 이해되어서는 안되며, 오히려 특정한 발명의 특정한 실시형태에 특유할 수 있는 특징들에 대한 설명으로서 이해되어야 한다. 개별적인 실시형태의 문맥에서 본 명세서에 기술된 특정한 특징들은 단일 실시형태에서 조합하여 구현될 수도 있다. 반대로, 단일 실시형태의 문맥에서 기술한 다양한 특징들 역시 개별적으로 혹은 어떠한 적절한 하위 조합으로도 복수의 실시형태에서 구현 가능하다. 나아가, 특징들이 특정한 조합으로 동작하고 초기에 그와 같이 청구된 바와 같이 묘사될 수 있지만, 청구된 조합으로부터의 하나 이상의 특징들은 일부 경우에 그 조합으로부터 배제될 수 있으며, 그 청구된 조합은 하위 조합이나 하위 조합의 변형물로 변경될 수 있다.While this specification contains numerous specific implementation details, they should not be construed as limitations on the scope of any invention or claim, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. should be understood Certain features that are described herein in the context of separate embodiments may be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment may also be implemented in multiple embodiments, either individually or in any suitable subcombination. Furthermore, although features operate in a particular combination and may be initially depicted as claimed as such, one or more features from a claimed combination may in some cases be excluded from the combination, the claimed combination being a sub-combination. or a variant of a sub-combination.

마찬가지로, 특정한 순서로 도면에서 동작들을 묘사하고 있지만, 이는 바람직한 결과를 얻기 위하여 도시된 그 특정한 순서나 순차적인 순서대로 그러한 동작들을 수행하여야 한다거나 모든 도시된 동작들이 수행되어야 하는 것으로 이해되어서는 안 된다. 특정한 경우, 멀티태스킹과 병렬 프로세싱이 유리할 수 있다. 또한, 상술한 실시형태의 다양한 장치 컴포넌트의 분리는 그러한 분리를 모든 실시형태에서 요구하는 것으로 이해되어서는 안되며, 설명한 프로그램 컴포넌트와 장치들은 일반적으로 단일의 소프트웨어 제품으로 함께 통합되거나 다중 소프트웨어 제품에 패키징 될 수 있다는 점을 이해하여야 한다.Likewise, although acts are depicted in the figures in a particular order, it should not be construed that all acts shown must be performed or that such acts must be performed in the specific order or sequential order shown in order to achieve desirable results. In certain cases, multitasking and parallel processing may be advantageous. Further, the separation of the various device components of the above-described embodiments should not be construed as requiring such separation in all embodiments, and the program components and devices described may generally be integrated together into a single software product or packaged into multiple software products. You have to understand that you can.

한편, 본 명세서와 도면에 개시된 본 발명의 실시 예들은 이해를 돕기 위해 특정 예를 제시한 것에 지나지 않으며, 본 발명의 범위를 한정하고자 하는 것은 아니다. 여기에 개시된 실시 예들 이외에도 본 발명의 기술적 사상에 바탕을 둔 다른 변형 예들이 실시 가능하다는 것은, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 자명한 것이다.On the other hand, the embodiments of the present invention disclosed in the present specification and drawings are merely presented as specific examples to aid understanding, and are not intended to limit the scope of the present invention. It will be apparent to those of ordinary skill in the art to which the present invention pertains that other modifications based on the technical spirit of the present invention can be implemented in addition to the embodiments disclosed herein.

101: 카메라
102: 제어 장치
110: 이미지 수신부
120: 동작 정보 수신부
130: 예측 이미지 생성부
140: 판별부
150: 동영상 생성부101: camera
102: control device
110: image receiving unit
120: motion information receiving unit
130: prediction image generator
140: determination unit
150: video generation unit

Claims

Receiving a low-resolution transmission video based on the high-resolution original video;
receiving motion information of a control device related to an interaction between objects included in the transmission video and location information of the objects or the control device;
Generating prediction images for a change in interaction force between the objects in the frame included in the original video but missing in the transmitted video using the transmitted video, the motion information, and the location information step; and
generating a high-resolution output video corresponding to the original video by determining a real image from among the predicted images
How to create a high-resolution video including.

According to claim 1,
The transmission video is
A method of creating a high-resolution video, which is a video composed of some frames extracted from a high-resolution original video created by shooting interactions between objects.

According to claim 1,
The operation information is
A method of generating a high-resolution video including at least one of a force applied by the control device to the object and a grip angle of the control device with respect to the object.

According to claim 1,
comparing frames not included in the transmission video among the frames of the original video with predicted images determined as the real image; and
Training a discriminant model based on the comparison result
A method of creating a high-resolution video further comprising a.

According to claim 1,
The step of generating the high-resolution video includes:
A high-resolution video generation method for determining the predicted images as a real image or a fake image using a discrimination model, and generating the output video by combining the predicted images determined as the real image.

6. The method of claim 5,
The real image is
It is a predicted image whose similarity with a frame that is not extracted as a transmission video among the frames of the original video is equal to or greater than a threshold,
The fake image is
A method of generating a high-resolution video, which is a predicted image having a similarity of less than a threshold value to a frame that is not extracted as a transmission video among the frames of the original video.

6. The method of claim 5,
The discriminant model is
Test transmission videos, motion information of the control device related to the interaction between the objects included in the test transmission videos, and location information of the objects or the control device to determine the interaction force between the objects generate test prediction images for change,
Learning real feature information for discriminating a real image from among the test prediction images by using the test transmission video and verification information,
A high-resolution video generation method for learning fake feature information for discriminating a fake image from among the test prediction images by using the test transmission video and the predicted image.

6. The method of claim 5,
The discriminant model is
generating a first feature map by inputting the operation information and the location information to an LSTM-FC (Long Short Term Memory - Fully Connected) network;
generating a second feature map by merging time information indicating a difference between the time at which the test transmission videos are captured and the time at which the test transmission videos are acquired with the first feature map;
A third feature map is generated by encoding the test transmission videos,
generating a fourth feature map by merging the third feature map and the second feature map;
A method of generating a high-resolution video for generating the predicted images by decoding the fourth feature map.

6. The method of claim 5,
The discriminant model is
Determine a mean squared error (MSE) loss function, a gradient difference loss (GDL) loss function, and an adversarial loss function using the test prediction image and verification information,
A high-resolution video generation method for determining a loss function of a discriminator for learning real feature information and fake feature information by combining the MSE loss function, the GDL loss function, and the adversarial loss function.

According to claim 1,
The transmission video is
It is a video of the previous time received at a time delayed by more than a certain time from the time when the high-resolution original video was taken,
The generating of the prediction images comprises:
A video generating method for generating prediction images for a change in interaction force between the objects at a current time of a frame included in the transmitted video and a frame missing in the transmitted video by using the motion information and the location information.

Receiving a video of a previous time according to a delay in the transmission process;
receiving current motion information of a control device related to an interaction between the objects included in the video and current location information of the objects or the control device;
generating prediction images for a change in interaction force between the objects at a current time using the moving picture of the previous time, the current motion information, and the current location information; and
Generating a video of the current time by determining a real image among the predicted images
How to create a real-time video including.

12. The method of claim 11,
The video from the previous time,
A method of generating a real-time video, in which the video generated by the camera capturing the interaction between objects at a previous time is delayed in the transmission process and received at the current time.

12. The method of claim 11,
The video from the previous time,
A method of generating a real-time video, which is a low-resolution transmission video composed of some frames extracted from a high-resolution original video generated by the camera capturing the interaction between objects at a previous time.

14. The method of claim 13,
The generating of the prediction images comprises:
A real-time video generation method for generating prediction images for a change in interaction force between the objects at a current time of a frame included in the original video but missing in the transmission video.

an image receiver for receiving a transmission video of low resolution based on a high-resolution original video;
a motion information receiver configured to receive motion information of a control device related to an interaction between objects included in the transmitted video, and location information of the objects or the control device;
Generating prediction images for a change in interaction force between the objects in the frame included in the original video but missing in the transmitted video using the transmitted video, the motion information, and the location information predictive image generator; and
A video generation unit for generating a high-resolution output video corresponding to the original video by determining a real image from among the predicted images
A video generating device comprising a.