KR102280201B1

KR102280201B1 - Method and apparatus for inferring invisible image using machine learning

Info

Publication number: KR102280201B1
Application number: KR1020190145789A
Authority: KR
Inventors: 정용우
Original assignee: 주식회사 스칼라웍스
Priority date: 2018-11-23
Filing date: 2019-11-14
Publication date: 2021-07-21
Also published as: KR20200061294A

Abstract

머신 러닝을 이용하여 은닉 이미지를 추론하는 방법 및 장치에 관한 것으로, 적어도 하나의 이미지, 적어도 하나의 텍스트, 및 적어도 하나의 특징값 중 적어도 두 개의 데이터를 입력하고, 생성망은 학습을 통해 구축된 이미지들간의 연관 관계 모델에 따라 입력된 적어도 두 개의 데이터로부터 추론되는 적어도 하나의 이미지를 생성하고, 분류망은 각 추론 이미지의 적합성 점수를 결정하고, 각 추론 이미지의 적합성 점수와 기준 점수의 비교 결과에 기초하여 각 추론 이미지를 입력된 적어도 두 개의 데이터가 나타내는 이미지들 사이의 은닉 이미지로서 출력한다.It relates to a method and apparatus for inferring a hidden image using machine learning, wherein at least two data of at least one image, at least one text, and at least one feature value is input, and a generative network is constructed through learning. Generates at least one image inferred from at least two data input according to the correlation model between images, the classification network determines the suitability score of each inferred image, and compares the suitability score of each inferred image with the reference score Outputs each inferred image as a hidden image between the images represented by at least two input data based on .

Description

{Method and apparatus for inferring invisible image using machine learning}

머신 러닝을 이용하여 이미지를 자동 생성하는 방법 및 장치에 관한 것이다.It relates to a method and apparatus for automatically generating an image using machine learning.

인공 지능을 소프트웨어적으로 구현하는 머신 러닝(machine learning)은 컴퓨터가 데이터를 학습하고 스스로 패턴을 찾아내 적절한 작업을 수행하도록 학습하는 알고리즘이다. 머신 러닝은 크게 지도 학습(supervised learning), 비지도 학습(unsupervised learning), 강화학습(reinforcement learning) 등으로 분류된다.Machine learning, which implements artificial intelligence in software, is an algorithm in which a computer learns data, finds patterns on its own, and learns to perform appropriate tasks. Machine learning is largely classified into supervised learning, unsupervised learning, reinforcement learning, and the like.

지도 학습은 정답이 주어진 상태에서 학습하는 알고리즘을 의미한다. 지도 학습의 목적은 분류(classification)와 회귀생성(regression)이다. 분류는 어떤 데이터를 유한 개의 클래스로 분류하는 것을 의미하고, 구체적으로는 두 개의 클래스로 분류하는 이진 분류(binary classification)와 셋 이상의 여러 개의 클래스로 분류하는 다중 분류(multiclass classification)로 나뉜다. 회귀생성은 데이터의 특징을 기반으로 연속적인 값을 예측하는 것을 말하며 연속적인 숫자를 예측하는 데에 사용된다. 회귀생성은 딥러닝을 활용하여 전혀 새로운 이미지를 만들거나 목소리를 합성(Text-to-Speech)하는데 사용된다.Supervised learning refers to an algorithm that learns from a given correct answer. The purpose of supervised learning is classification and regression. Classification means classifying certain data into a finite number of classes. Specifically, it is divided into binary classification, which classifies data into two classes, and multiclass classification, which classifies data into three or more classes. Regression generation refers to predicting successive values based on the characteristics of data and is used to predict successive numbers. Regression generation is used to create completely new images or synthesize voices (Text-to-Speech) using deep learning.

비지도 학습은 정답이 주어지지 않은 상태에서 학습하는 알고리즘이며, 대표적인 비지도 학습에는 군집화(clustering)가 있다. 이러한 지도 학습과 비지도 학습으로 데이터를 기반으로 상황을 예측하고자 하는 시도가 지속되고 있지만, 지도 학습은 정답이 주어진 데이터만을 사용할 수 있기 때문에 사용할 수 있는 데이터의 양에 한계가 있으므로, 인공지능에서 비지도 학습이 점점 더 중요해지고 있으며, 이러한 비지도 학습의 대표적인 것으로 GAN(Generative Adversarial Networks)이 있다.Unsupervised learning is an algorithm that learns without an answer given, and a representative unsupervised learning includes clustering. Attempts to predict situations based on data with such supervised and unsupervised learning continue, but supervised learning has a limit on the amount of data that can be used because only data with a given correct answer can be used. Supervised learning is becoming more and more important, and a representative example of such unsupervised learning is Generative Adversarial Networks (GAN).

대한민국공개특허공보 제10-2017-0137350호 "신경망 생성 모델을 이용한 객체 움직임 패턴 학습장치 및 그 방법"은 입력영상에서 움직이는 객체의 특징점을 추적하고 각 움직임 패턴을 학습하여 토픽별로 분류하는 움직임 패턴 획득부 및 상기 움직임 패턴 획득부에 저장된 움직임 패턴 정보를 이용하여 입력된 이미지에 따른 움직임 패턴을 학습하는 이미지별 움직임 패턴 학습부를 포함하는 신경망 생성 모델을 이용한 객체 움직임 패턴 학습장치를 개시하고 있다. Korean Patent Application Laid-Open No. 10-2017-0137350 "Apparatus and method for learning object movement pattern using neural network generation model" acquires movement patterns classified by topic by tracking feature points of moving objects in an input image and learning each movement pattern Disclosed is an apparatus for learning an object movement pattern using a neural network generation model including a unit and a movement pattern learning unit for each image that learns a movement pattern according to an input image using movement pattern information stored in the movement pattern acquisition unit.

그러나, 이 종래기술은 감시지역에 대한 하루정도 분량의 CCTV(Closed Circuit Television) 카메라 영상을 학습영상으로 입력받아 DBN(Deep Belief Networks) 및 GAN(Generative Adversarial Networks)을 이용하여 학습한 이후 해당 감시지역의 영상을 입력받아 어떠한 움직임 패턴이 나타나고 있는지를 확인할 수 있지만, 건물 등 장애물에 가려 보이지 않거나 CCTV가 촬영할 수 없는 사각지대의 보이지 않는 대상에 대한 이미지를 제공할 수 없다.However, this prior art receives a CCTV (Closed Circuit Television) camera image of about a day for a monitoring area as a learning image, and after learning using DBN (Deep Belief Networks) and GAN (Generative Adversarial Networks), the monitoring area It is possible to check what movement pattern is appearing by receiving the video of the image, but it cannot provide an image of an invisible object in a blind spot that cannot be captured by CCTV or is obscured by obstacles such as a building.

건물 등 장애물에 가려 보이지 않거나 CCTV가 촬영할 수 없는 사각지대의 보이지 않는 대상에 대한 이미지를 제공할 수 있는 은닉 이미지를 추론하는 방법 및 장치를 제공하는 데에 있다. 또한, 그 방법을 컴퓨터에서 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공하는 데에 있다. 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 이하의 설명으로부터 또 다른 기술적 과제가 도출될 수도 있다.An object of the present invention is to provide a method and apparatus for inferring a hidden image that can provide an image of an invisible object in a blind spot that cannot be seen by an obstacle such as a building or can not be photographed by CCTV. Another object of the present invention is to provide a computer-readable recording medium in which a program for executing the method in a computer is recorded. It is not limited to the technical problems as described above, and another technical problem may be derived from the following description.

본 발명의 일 측면에 따른 은닉 이미지 추론 방법은 적어도 하나의 이미지, 적어도 하나의 텍스트, 및 적어도 하나의 특징값 중 적어도 두 개의 데이터를 입력하는 단계; 생성망은 학습을 통해 구축된 이미지들간의 연관 관계 모델에 따라 상기 입력된 적어도 두 개의 데이터로부터 추론되는 적어도 하나의 이미지를 생성하는 단계; 분류망은 상기 생성된 각 추론 이미지의 적합성 점수를 결정하는 단계; 및 상기 생성된 추론 이미지의 적합성 점수와 기준 점수의 비교 결과에 기초하여 상기 생성된 각 추론 이미지를 상기 입력된 적어도 두 개의 데이터가 나타내는 이미지들 사이의 은닉 이미지로서 출력하는 단계를 포함한다. A hidden image inference method according to an aspect of the present invention comprises: inputting at least two data of at least one image, at least one text, and at least one feature value; generating, by the network, generating at least one image inferred from the input at least two data according to a correlation model between images built through learning; determining, by the classification network, a suitability score of each of the generated inference images; and outputting each of the generated inference images as a hidden image between images represented by the input at least two data based on a comparison result between the suitability score and the reference score of the generated inference image.

상기 적합성 점수를 결정하는 단계는 상기 입력된 적어도 두 개의 데이터로부터 상기 각 추론 이미지의 적합성 점수를 결정할 수 있다. The determining of the suitability score may include determining the suitability score of each of the inferred images from the input at least two pieces of data.

상기 적합성 점수를 결정하는 단계는 상기 생성된 각 추론 이미지를 상기 생성망에 입력함으로써 생성된 적어도 하나의 이미지, 적어도 하나의 텍스트, 및 적어도 하나의 특징값 중 적어도 두 개의 데이터와 상기 입력된 적어도 두 개의 데이터의 유사도에 따라 상기 적합성 점수를 결정할 수 있다. The determining of the suitability score includes at least two data of at least one image, at least one text, and at least one feature value generated by inputting each of the generated inference images to the generation network and the input at least two The suitability score may be determined according to the similarity of the data of dogs.

상기 추론 이미지를 생성하는 단계는 상기 적어도 하나의 이미지, 적어도 하나의 텍스트, 및 적어도 하나의 특징값 중 적어도 두 개의 데이터를 잠재 벡터들로 변환하는 단계; 상기 변환된 잠재 벡터들을 보간함으로써 상기 변환된 잠재 벡터들 사이에 위치하는 잠재 벡터를 산출하는 단계; 및 상기 변환된 잠재 벡터들과 상기 산출된 잠재 벡터를 이용하여 상기 입력된 적어도 두 개의 데이터로부터 추론되는 적어도 하나의 이미지를 생성하는 단계를 포함할 수 있다. The generating of the inferred image may include: transforming at least two data of the at least one image, at least one text, and at least one feature value into latent vectors; calculating a latent vector located between the transformed latent vectors by interpolating the transformed latent vectors; and generating at least one image inferred from the at least two input data using the transformed latent vectors and the calculated latent vector.

상기 추론 이미지를 생성하는 단계는 상기 입력된 각 이미지를 잠재 벡터로 변환하는 단계; 및 상기 변환된 잠재 벡터를 이용하여 상기 입력된 각 이미지 관점에서 상기 입력된 적어도 두 개의 데이터가 나타내는 이미지들간의 연관 관계의 특징들을 추론하는 단계를 포함할 수 있다. The generating of the inferred image may include: converting each input image into a latent vector; and inferring characteristics of a correlation between images represented by the at least two input data from the viewpoint of each input image by using the transformed latent vector.

상기 추론 이미지를 생성하는 단계는 상기 입력된 각 텍스트를 잠재 벡터로 변환하는 단계; 및 상기 변환된 잠재 벡터를 이용하여 상기 입력된 각 텍스트 관점에서 상기 입력된 적어도 두 개의 데이터가 나타내는 이미지들간의 연관 관계의 특징들을 추론하는 단계를 포함할 수 있다. The generating of the inference image may include: converting each inputted text into a latent vector; and inferring characteristics of a correlation between images represented by the at least two input data from the viewpoint of each input text using the transformed latent vector.

상기 추론 이미지를 생성하는 단계는 상기 입력된 각 특징값을 잠재 벡터로 변환하는 단계; 및 상기 변환된 잠재 벡터를 이용하여 상기 입력된 각 특징값 관점에서 상기 입력된 적어도 두 개의 데이터가 나타내는 이미지들간의 연관 관계의 특징들을 추론하는 단계를 포함할 수 있다. The generating of the inferred image may include: converting each of the input feature values into a latent vector; and inferring characteristics of a correlation between images represented by the at least two input data from the viewpoint of each input feature value by using the transformed latent vector.

상기 추론 이미지를 생성하는 단계는 상기 입력된 적어도 하나의 이미지, 적어도 하나의 텍스트, 및 적어도 하나의 특징값을 잠재 벡터들로 변환하는 단계; 상기 변환된 잠재 벡터들을 이용하여 상기 입력된 적어도 두 개의 데이터가 나타내는 이미지들간의 연관 관계의 특징들을 추론하는 단계; 및 상기 추론된 특징들을 통합하고, 상기 연관 관계 모델에 따라 상기 통합된 특징들로부터 추론되는 적어도 하나의 이미지를 생성함으로써 상기 입력된 적어도 두 개의 데이터로부터 추론되는 적어도 하나의 이미지를 생성하는 단계를 포함할 수 있다. The generating of the inferred image may include: converting the input at least one image, at least one text, and at least one feature value into latent vectors; inferring characteristics of a correlation between images represented by the at least two input data using the transformed latent vectors; and generating at least one image inferred from the input at least two data by integrating the inferred features and generating at least one image inferred from the integrated features according to the association model. can do.

본 발명의 다른 측면에 따라 상기 은닉 이미지 추정 방법을 컴퓨터에서 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체가 제공된다.According to another aspect of the present invention, there is provided a computer-readable recording medium in which a program for executing the method for estimating the hidden image in a computer is recorded.

본 발명의 또 다른 측면에 따른 은닉 이미지 추론 장치는 적어도 하나의 이미지, 적어도 하나의 텍스트, 및 적어도 하나의 특징값 중 적어도 두 개의 데이터를 입력하는 입력 모듈; 학습을 통해 구축된 이미지들간의 연관 관계 모델에 따라 상기 입력된 적어도 두 개의 데이터로부터 추론되는 적어도 하나의 이미지를 생성하는 생성망; 상기 생성된 각 추론 이미지의 적합성 점수를 결정하는 분류망; 및 상기 생성된 추론 이미지의 적합성 점수와 기준 점수의 비교 결과에 기초하여 상기 생성된 각 추론 이미지를 상기 입력된 적어도 두 개의 데이터가 나타내는 이미지들 사이의 은닉 이미지로서 출력하는 출력 모듈을 포함한다. A hidden image reasoning apparatus according to another aspect of the present invention includes an input module for inputting at least two data of at least one image, at least one text, and at least one feature value; a generation network for generating at least one image inferred from the input at least two data according to a correlation model between images built through learning; a classification network that determines a suitability score of each of the generated inference images; and an output module for outputting each of the generated inference images as a hidden image between images represented by the input at least two data based on a comparison result of a suitability score and a reference score of the generated inferred image.

적어도 하나의 이미지, 적어도 하나의 텍스트, 및 적어도 하나의 특징값 중 적어도 두 개의 데이터를 입력하고, 생성망은 학습을 통해 구축된 이미지들간의 연관 관계 모델에 따라 입력된 적어도 두 개의 데이터로부터 추론되는 적어도 하나의 이미지를 생성하고, 분류망은 각 추론 이미지의 적합성 점수를 결정하고, 각 추론 이미지의 적합성 점수와 기준 점수의 비교 결과에 기초하여 각 추론 이미지를 입력된 적어도 두 개의 데이터가 나타내는 이미지들 사이의 은닉 이미지로서 출력함으로써 건물 등 장애물에 가려 보이지 않거나 CCTV가 촬영할 수 없는 사각지대의 보이지 않는 대상에 대한 이미지를 그 주변 이미지, 텍스트, 이미지 특징값을 이용하여 제공할 수 있다.At least two data of at least one image, at least one text, and at least one feature value is input, and the generating network is inferred from the at least two data input according to the relation model between images built through learning. At least one image is generated, the classification network determines a suitability score of each inferred image, and based on the comparison result of the suitability score of each inferred image and the reference score, each inferred image is an image represented by at least two input data. By outputting it as a hidden image in between, it is possible to provide an image of an invisible object in a blind spot that cannot be photographed by an obstacle such as a building or by using the surrounding image, text, and image feature values.

특히, 입력 이미지가 하나만 존재하는 경우에도 그 입력 이미지와 연관된 이미지의 장면을 설명하는 텍스트나 이미지의 특징값을 이미지 대신에 입력할 경우에 그 입력 이미지와 연관된 은닉 이미지가 추론될 수 있다. 나아가, 입력 이미지가 없는 경우에도 어떤 장면을 설명하는 텍스트나 이미지의 특징값으로부터 은닉 이미지가 추론될 수 있다. In particular, even when there is only one input image, when text describing a scene of an image associated with the input image or a feature value of an image is input instead of an image, a hidden image associated with the input image may be inferred. Furthermore, even in the absence of an input image, a hidden image may be inferred from a text or image feature value describing a certain scene.

도 1은 본 발명의 일 실시예에 따른 은닉 이미지 추론 장치의 구성도이다.
도 2는 도 1에 도시된 분류망(30)의 예시도이다.
도 3은 도 1에 도시된 생성망(40)의 예시도이다.
도 4는 본 발명의 일 실시예에 따른 은닉 이미지 추론 방법의 흐름도이다.
도 5는 도 1에 도시된 분류망(30)의 이미지 적합성 점수 결정 방식을 설명하기 위한 개념도이다.
도 6은 도 4에 도시된 1000 단계에서 입력되는 모든 데이터가 이미지인 경우의 예시도이다.
도 7은 도 4에 도시된 1000 단계에서 입력되는 데이터 일부가 특징값인 경우의 예시도이다.
도 8은 도 4에 도시된 1000 단계에서 입력되는 데이터 일부가 텍스트인 경우의 예시도이다.
도 9는 도 4에 도시된 1000 단계에서 입력되는 데이터 일부가 특징값과 텍스트인 경우의 예시도이다.
도 10은 도 6에 도시된 예에 대한 생성망(40)의 데이터 처리 과정도이다.
도 11은 도 10에 도시된 예에 대한 생성망(40)의 데이터 처리 과정도이다.
도 12는 도 4에 도시된 특징값 프로세싱과 텍스트 프로세싱의 상세도이다.
도 13은 도 1에 도시된 분류망(30)과 생성망(40)의 학습 예를 도시한 도면이다.
도 14는 도 4에 도시된 은닉 이미지 추론 방법의 활용 예를 도시한 도면이다. 1 is a block diagram of a hidden image inference apparatus according to an embodiment of the present invention.
FIG. 2 is an exemplary diagram of the classification network 30 shown in FIG. 1 .
FIG. 3 is an exemplary diagram of the generating network 40 shown in FIG. 1 .
4 is a flowchart of a hidden image inference method according to an embodiment of the present invention.
FIG. 5 is a conceptual diagram for explaining a method of determining an image suitability score of the classification network 30 shown in FIG. 1 .
FIG. 6 is an exemplary diagram in which all data input in step 1000 shown in FIG. 4 are images.
FIG. 7 is an exemplary diagram of a case in which a part of data input in step 1000 shown in FIG. 4 is a feature value.
FIG. 8 is an exemplary diagram when a part of data input in step 1000 shown in FIG. 4 is text.
FIG. 9 is an exemplary diagram of a case in which a part of data input in step 1000 shown in FIG. 4 is a feature value and text.
FIG. 10 is a data processing process diagram of the generating network 40 for the example shown in FIG. 6 .
FIG. 11 is a data processing process diagram of the generating network 40 for the example shown in FIG. 10 .
FIG. 12 is a detailed diagram of feature value processing and text processing shown in FIG. 4 .
13 is a diagram illustrating an example of learning of the classification network 30 and the generating network 40 shown in FIG. 1 .
14 is a diagram illustrating an example of application of the hidden image inference method shown in FIG. 4 .

이하에서는 도면을 참조하여 본 발명의 실시예를 상세히 설명한다. Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings.

도 1은 본 발명의 일 실시예에 따른 은닉 이미지 추론 장치의 구성도이다. 도 1을 참조하면, 본 실시예에 따른 은닉 이미지 추론 장치는 입력 모듈(10), 제어 모듈(20), 분류망(discriminative network)(30), 생성망(generative network)(40), 스토리지(50), 및 출력 모듈(60)로 구성된다. 본 발명의 일 실시예에 의한 딥러닝을 이용하여 은닉 이미지를 추론하는 방법은 원본 이미지(1)가 전부가 입력되는 경우, 입력 모듈(10)은 입력 데이터(1)를 분류망(30)과 생성망(40)으로 출력하고, 제어 모듈(20)의 제어에 따라 분류망(30)과 생성망(40)은 은닉 이미지(1')를 추론하여 생성하고, 출력 모듈(60)은 이와 같이 생성된 은닉 이미지를 출력한다.1 is a block diagram of a hidden image inference apparatus according to an embodiment of the present invention. 1, the hidden image reasoning apparatus according to the present embodiment includes an input module 10, a control module 20, a discriminative network 30, a generative network 40, a storage ( 50 ), and an output module 60 . In the method of inferring a hidden image using deep learning according to an embodiment of the present invention, when all of the original image 1 is input, the input module 10 divides the input data 1 into the classification network 30 and It outputs to the generating network 40, and according to the control of the control module 20, the classification network 30 and the generating network 40 infer and generate the hidden image 1', and the output module 60 generates it in this way. Outputs the generated hidden image.

도 1에 도시된 분류망(30)과 생성망(40) 각각은 GAN(Generative Adversarial Networks)의 분류자에 해당하는 신경망과 생성자에 해당하는 신경망을 의미한다. GAN의 기본 구성은 논문 "Generative Adversarial Networks" (2014년 발표, Goodfellow, Ian J. ; Pouget-Abadie, jean; Mirza, Mehdi; Xu, Bing; Warde-Faeley, David; Ozair, Sherjil; Courville, Aaron; Bengio, Yoshua 공저)에 개시되어 있다. GAN의 적대적 학습에서는 분류망을 먼저 학습시킨 후, 생성망을 학습시키는 과정을 서로 주고받으면서 반복한다. Each of the classification network 30 and the generative network 40 shown in FIG. 1 means a neural network corresponding to a classifier of a Generative Adversarial Networks (GAN) and a neural network corresponding to a generator. The basic composition of the GAN is based on the paper "Generative Adversarial Networks" (published in 2014, Goodfellow, Ian J. ; Pouget-Abadie, jean; Mirza, Mehdi; Xu, Bing; Warde-Faeley, David; Ozair, Sherjil; Courville, Aaron; Bengio and Yoshua co-authors). In adversarial learning of GANs, the classification network is first trained, and then the process of learning the generative network is repeated while giving and receiving.

분류망의 학습은 크게 두 가지 단계로 이루어진다. 첫 번째 단계는 분류망에 진짜 데이터를 입력해서 분류망이 그 데이터를 진짜로 분류하도록 학습시키는 과정이고, 두 번째 단계는 첫 번째 단계와 반대로 생성망에서 생성한 가짜 데이터를 분류망에 입력해서 분류망이 그 데이터를 가짜로 분류하도록 학습하는 과정이다. 이 과정을 통해 분류망은 진짜 데이터를 진짜로, 가짜 데이터를 가짜로 분류할 수 있게 된다. 분류망을 학습시킨 다음에는 학습된 분류망을 속이는 방향으로 생성망을 학습시킨다. 생성망에서 만들어낸 가짜 데이터를 분류망에 입력하고, 분류망이 가짜 데이터를 진짜라고 분류할 만큼 진짜 데이터와 유사한 데이터를 만들어 내도록 생성망을 학습시킨다.Classification network training consists of two main steps. The first step is the process of inputting real data into the classification network and training the classification network to classify the data as real. In the second step, contrary to the first step, the fake data generated by the generative network is input into the classification network. This is the process of learning to classify the data as fake. Through this process, the classification network can classify real data as real and fake data as fake. After training the classification network, the generative network is trained in the direction of deceiving the learned classification network. The fake data created by the generative network is input into the classification network, and the generative network is trained to generate data similar to real data enough to classify the fake data as real.

이와 같은 학습과정을 반복하면 분류망과 생성망이 서로를 적대적인 경쟁자로 인식하여 모두 발전하게 되고, 결과적으로 생성망은 진짜 데이터와 완벽히 유사한 가짜 데이터를 만들 수 있게 되고 이에 따라 분류망은 진짜 데이터와 가짜 데이터를 구분할 수 없게 된다. 즉, GAN에서는 생성망은 분류에 성공할 확률을 낮추려 하고, 분류망은 분류에 성공할 확률을 높이려 하면서 서로가 서로를 경쟁적으로 발전시키는 구조를 이루고 있다.If this learning process is repeated, the classification network and the generative network recognize each other as hostile competitors and both develop. As a result, the generative network can create fake data perfectly similar to the real data. Fake data cannot be distinguished. That is, in the GAN, the generation network tries to lower the probability of successful classification, and the classification network tries to increase the probability of successful classification, while forming a structure in which each other competitively develops.

보다 구체적으로, GAN은 다음과 수학식 1과 같은 목적함수 V(D,G)를 이용하여 최소 최대 문제(minmax problem)를 푸는 방식으로 학습하게 된다. More specifically, the GAN learns by solving a minmax problem using the objective function V(D,G) as shown in Equation 1 below.

여기에서, x~p_data(x)는 실제 데이터에 대한 확률분포에서 샘플링한 데이터를 의미하고, z~p_z(z)는 일반적으로 가우시안 분포를 사용하는 임의의 노이즈에서 샘플링한 데이터를 의미한다. z를 통상적으로 잠재 벡터(latent vector)라고 부르는데 차원이 줄어든 채로 데이터를 잘 설명할 수 있는 잠재 공간에서의 벡터를 의미한다. D(x)는 분류망이고 진짜일 확률을 의미하는 0과 1 사이의 값이라서, 데이터가 진짜이면 D(x)는 1, 가짜이면 0의 값을 산출한다. G(z)는 생성망이고, D(G(z))는 생성망이 만들어낸 데이터인 G(z)가 진짜라고 판단되면 1, 가짜라고 판단되면 0의 값을 산출한다.Here, x~p _data (x) means data sampled from a probability distribution with respect to actual data, and z~p _z (z) means data sampled from random noise using a Gaussian distribution in general. . z is usually called a latent vector, which means a vector in the latent space that can describe data well with reduced dimensions. Since D(x) is a classification network and a value between 0 and 1, which means the probability of being real, D(x) yields a value of 1 if the data is real and 0 if it is fake. G(z) is a generative network, and D(G(z)) yields a value of 1 if it is determined that G(z), the data generated by the generative network, is real, and 0 if it is judged to be fake.

분류망이 V(D,G)를 최대화하는 관점에서 생각해 보면, 수학식 1의 산출 값을 최대화하기 위해서는 우변의 첫 번 째 항과 두 번째 항 모두 최대가 되어야 하므로 logD(x)와 log(1-D(G(z))) 모두 최대가 되어야 한다. 따라서, D(x)는 1이 되어야 하며 이는 실제 데이터를 진짜라고 분류하도록 분류망을 학습하는 것을 의미한다. 마찬가지로 1-D(G(z)는 1이 되어 D(G(z)는 따라서 0이어야 하며, 이는 생성망이 만들어낸 가짜 데이터를 가짜라고 분류하도록 분류망을 학습하는 것을 의미한다. 즉, V(D,G)가 최대가 되도록 분류망을 학습하는 것은 분류망이 진짜 데이터를 진짜로, 가짜 데이터를 가짜로 분류하도록 학습하는 과정이다.Considering from the point of view that the classification network maximizes V(D,G), in order to maximize the calculated value of Equation 1, both the first term and the second term on the right side must be maximal, -D(G(z))) must all be maximal. Therefore, D(x) should be 1, which means training the classification network to classify real data as real. Similarly, 1-D(G(z) becomes 1, so D(G(z) must therefore be 0), which means training the classification network to classify the fake data generated by the generative network as fake. Learning the classification network so that (D, G) is maximized is the process of learning the classification network to classify real data as real and fake data as fake.

다음으로 생성망이 V(D,G)를 어떻게 최소화하도록 학습하는지에 대한 관점에서 생각해 보면, 수학식 1의 우변 첫 번째 항에는 G가 포함되어 있지 않으므로 생성망과 관련이 없어 생략이 가능하다. 두 번째 항을 최소화하기 위해서는 log(1-D(G(z)))가 최소가 되어야 한다. 따라서 log(1-D(G(z)))는 0이 되어야 하고 D(G(z)는 1이 되어야 한다. 이는 분류망이 진짜로 분류할 만큼 완벽한 가짜 데이터를 생성하도록 생성망을 학습시키는 것을 의미한다. 이와 같이, V(D,G)를 최대화하는 방향으로 분류망을 학습하고, V(D,G)를 최소화하는 방향으로 생성망을 학습하는 것을 최소 최대 문제(Minmax problem)라고 한다.Next, thinking from the perspective of how the generative network learns to minimize V(D,G), since G is not included in the first term on the right side of Equation 1, it is not related to the generative network and can be omitted. To minimize the second term, log(1-D(G(z))) must be minimized. So log(1-D(G(z))) should be 0 and D(G(z)) should be 1. This is not enough to train the generative network to generate fake data that is perfect enough to classify as true. In this way, learning the classification network in the direction of maximizing V(D,G) and learning the generative network in the direction of minimizing V(D,G) is called the Minmax problem.

본 발명에 따른 실시예 설명이 장황해지는 것을 방지하기 위하여 이상에서 설명된 분류망(30)과 생성망(40)의 기본적 구성 등 본 실시예가 속하는 기술분야에서 공지된 기술에 대해서는 설명을 생략하고 이하에서는 종래기술과 차별되는 본 실시예의 특징을 중심으로 본 실시예를 설명하기로 한다. In order to prevent the description of the embodiment according to the present invention from being tedious, the description of known techniques in the technical field to which this embodiment belongs, such as the basic configuration of the classification network 30 and the generating network 40 described above, will be omitted and hereinafter will be omitted. The present embodiment will be described with a focus on the characteristics of the present embodiment that are different from the prior art.

도 2는 도 1에 도시된 분류망(30)의 예시도이고, 도 3은 도 1에 도시된 생성망(40)의 예시도이다. 도 2, 3에 도시된 바와 같이, 분류망(30)과 생성망(40)은 서로 유사한 구조의 CNN(Convolutional Neural Network) 기반 딥러닝 신경망으로 구현될 수 있다. 아래에서 설명된 바와 같이, 생성망(40)은 적어도 하나의 이미지, 적어도 하나의 텍스트, 및 적어도 하나의 특징값 중 적어도 두 개의 데이터로부터 추론되는 은닉 이미지를 생성하고, 분류망(30)은 생성망(40)에 의해 생성된 이미지의 적합성 점수를 출력한다.FIG. 2 is an exemplary diagram of the classification network 30 illustrated in FIG. 1 , and FIG. 3 is an exemplary diagram of the generation network 40 illustrated in FIG. 1 . 2 and 3 , the classification network 30 and the generation network 40 may be implemented as a Convolutional Neural Network (CNN)-based deep learning neural network having a structure similar to each other. As described below, the generating network 40 generates a hidden image inferred from data of at least two of at least one image, at least one text, and at least one feature value, and the classification network 30 generates The suitability score of the image generated by the network 40 is output.

도 4는 본 발명의 일 실시예에 따른 은닉 이미지 추론 방법의 흐름도이다. 도 4를 참조하면, 본 실시예에 따른 은닉 이미지 추론 방법은 도 1에 도시된 은닉 이미지 추론 장치에서 시계열적으로 실행되는 다음과 같은 단계들로 구성된다. 4 is a flowchart of a hidden image inference method according to an embodiment of the present invention. Referring to FIG. 4 , the hidden image inference method according to the present embodiment consists of the following steps that are time-series executed in the hidden image inference apparatus shown in FIG. 1 .

1000 단계에서 입력 모듈(10)은 제어 모듈(20)의 제어에 따라 적어도 하나의 이미지, 적어도 하나의 텍스트, 및 적어도 하나의 특징값 중 적어도 두 개의 데이터를 분류망(30)과 생성망(40) 각각에 입력한다. 이러한 데이터는 사용자의 데이터 업로드 등의 조작에 따라 입력 모듈(10)을 통하여 분류망(30)과 생성망(40) 각각에 입력된다. 여기에서, 생성망(40)에 입력되는 이미지는 사용자에 의해 획득된 이미지의 데이터를 의미한다. 예를 들어, 사용자가 카메라를 이용하여 어떤 대상을 일부 각도에서만 촬영할 수 있는 경우에 생성망(40)에 입력되는 이미지는 그 일부 각도에서 대상을 촬영함으로써 획득된 이미지의 데이터일 수도 있다. 이 경우, 은닉 이미지는 촬영이 불가능한 각도에서 대상을 촬영할 경우에 획득될 것으로 예상되는 이미지이다.In step 1000 , the input module 10 generates at least two data among at least one image, at least one text, and at least one feature value according to the control of the control module 20 into the classification network 30 and the generating network 40 . ) for each. Such data is input to each of the classification network 30 and the generating network 40 through the input module 10 according to a user's operation such as data upload. Here, the image input to the generating network 40 means data of an image acquired by a user. For example, when a user can photograph an object from a partial angle using a camera, the image input to the generating network 40 may be data of an image obtained by photographing the object from the partial angle. In this case, the hidden image is an image expected to be acquired when the object is photographed from an angle that cannot be photographed.

생성망(40)에 입력되는 텍스트는 어떤 이미지가 나타내는 장면을 설명하는 텍스트를 의미한다. 생성망(40)에 입력되는 특징값은 어떤 이미지의 특징들을 나타내는 값을 의미한다. 본 실시예에서, 하나의 텍스트나 특징값은 하나의 이미지를 대체하는 역할을 한다. 본 실시예에 따르면, 사용자가 어떤 대상에 대하여 하나의 이미지만을 획득할 수 있거나 이미지를 얻을 수 없는 상황인 경우에도 사용자가 이미지 대신에 텍스트나 특징값을 입력 모듈(10)을 통해 생성망(40)에 입력함으로써 은닉 이미지를 획득할 수도 있다. The text input to the generating network 40 means text describing a scene represented by an image. The feature value input to the generating network 40 means a value representing the features of a certain image. In this embodiment, one text or feature value serves to replace one image. According to the present embodiment, even in a situation in which the user can obtain only one image for a certain object or the image cannot be obtained, the user inputs text or feature values instead of the image through the input module 10 to generate the network 40 ) to obtain a hidden image.

여러 이미지들이 어떤 이벤트에 관하여 서로 공간적 및 시간적인 연관 관계를 갖고 있는 경우에, 이러한 이미지들간의 연관 관계는 적어도 하나의 이미지, 적어도 하나의 텍스트, 및 적어도 하나의 특징값 중 적어도 두 개의 데이터로부터 추론될 수 있다. 예를 들어, 어떤 대상에 대해 두 개의 각도에서 촬영된 두 개의 이미지로부터 어떤 공간에 존재하는 대상을 360도에 걸쳐 차례로 촬영하는 이벤트에 관하여 그 대상을 중심으로 하는 360도의 공간적 연관 관계 및 그 대상의 변화를 차례로 촬영하는 시간적 연관 관계가 추론될 수 있다.When several images have spatial and temporal correlation with each other with respect to an event, the correlation between these images is inferred from data of at least two of at least one image, at least one text, and at least one feature value. can be For example, from two images taken at two angles for a certain object, with respect to an event of taking an object in a certain space one after another over 360 degrees, a 360-degree spatial relationship centered on the object and the relationship of the object A temporal correlation of photographing changes one after another can be inferred.

입력 모듈(10)이 복수 개의 이미지를 생성망(40)에 입력한 경우에 생성망(40)에 의해 이미지들간의 연관 관계가 추론될 수 있다. 입력 모듈(10)이 복수 개의 텍스트를 생성망(40)에 입력한 경우에 이미지들간의 연관 관계가 추론될 수 있다. 입력 모듈(10)이 복수 개의 특징값을 생성망(40)에 입력한 경우에 이미지들간의 연관 관계가 추론될 수 있다. 입력 모듈(10)이 적어도 하나의 이미지와 적어도 하나의 텍스트를 생성망(40)에 입력한 경우에 이미지들간의 연관 관계가 추론될 수 있다. 입력 모듈(10)이 적어도 하나의 이미지와 적어도 하나의 특징값을 생성망(40)에 입력한 경우에 이미지들간의 연관 관계가 추론될 수 있다. 입력 모듈(10)이 적어도 하나의 텍스트와 적어도 하나의 특징값을 생성망(40)에 입력한 경우에 이미지들간의 연관 관계가 추론될 수 있다. When the input module 10 inputs a plurality of images to the generating network 40 , a correlation between the images may be inferred by the generating network 40 . When the input module 10 inputs a plurality of texts to the generating network 40 , a correlation between images may be inferred. When the input module 10 inputs a plurality of feature values to the generating network 40 , a correlation between images may be inferred. When the input module 10 inputs at least one image and at least one text to the generating network 40 , a correlation between the images may be inferred. When the input module 10 inputs at least one image and at least one feature value to the generating network 40 , a correlation between the images may be inferred. When the input module 10 inputs at least one text and at least one feature value to the generating network 40 , a correlation between images may be inferred.

2000 단계에서 생성망(40)은 1000 단계에서 입력된 적어도 하나의 이미지가 존재하는 경우, 1000 단계에서 입력된 각 이미지의 임베딩 계층을 통하여 1000 단계에서 입력된 각 이미지를 컨볼루션 계층에서 처리 가능한 형식의 스택 이미지(stacked image)로 변환하고 그 스택 이미지를 이것을 나타내는 잠재 벡터로 변환한다. 이어서, 생성망(40)은 복수의 컨볼루션 계층을 통하여 이와 같이 변환된 잠재 벡터를 이용하여 1000 단계에서 입력된 각 이미지 관점에서 1000 단계에서 입력된 적어도 두 개의 데이터가 나타내는 이미지들간의 연관 관계의 특징들을 추론한다. 분류망(30)으로부터 생성망(40)으로 어떤 추론 이미지에 대한 적합성 점수가 피드백된 경우, 생성망(40)은 분류망(30)으로부터 피드백된 추론 이미지의 적합성 점수가 향상되는 방향으로 잠재 벡터의 값을 갱신한다. 이러한 잠재 벡터 갱신에 의해 잠재 벡터는 이미지들간의 연관 관계에서의 해당 이미지의 특징들을 더 잘 나타낼 수 있게 된다. 이러한 과정을 통해 생성망(40)이 학습된다. 생성망(40)의 학습량이 많아질수록 잠재 벡터는 이미지들간의 연관 관계에서의 해당 이미지의 특징들을 점점 더 잘 나타낼 수 있게 된다. In step 2000, when at least one image input in step 1000 exists, the generating network 40 processes each image input in step 1000 through an embedding layer of each image input in step 1000 in a convolutional layer. Convert it to a stacked image of , and convert that stack image to a latent vector representing it. Next, the generation network 40 uses the latent vector transformed as described above through a plurality of convolutional layers to determine the correlation between the images represented by the at least two data input in step 1000 from the viewpoint of each image input in step 1000. Infer features. When the suitability score for a certain inferred image is fed back from the classification network 30 to the generating network 40 , the generating network 40 is a latent vector in the direction that the suitability score of the inferred image fed back from the classification network 30 is improved. update the value of By updating the latent vector, the latent vector can better represent the features of the image in the relation between the images. Through this process, the generating network 40 is learned. As the learning amount of the generating network 40 increases, the latent vector can more and more better represent the features of the image in the relation between the images.

2000 단계에서 생성망(40)은 1000 단계에서 입력된 적어도 하나의 텍스트가 존재하는 경우, 1000 단계에서 입력된 각 텍스트의 임베딩(embedding) 계층을 통하여 1000 단계에서 입력된 각 텍스트를 이것을 나타내는 잠재 벡터로 변환한다. 이어서, 생성망(40)은 텍스트 프로세싱 계층을 통하여 이와 같이 변환된 잠재 벡터를 이용하여 1000 단계에서 입력된 각 텍스트 관점에서 1000 단계에서 입력된 적어도 두 개의 데이터가 나타내는 이미지들간의 연관 관계의 특징들을 추론한다. 분류망(30)으로부터 생성망(40)으로 어떤 추론 이미지에 대한 적합성 점수가 피드백된 경우, 생성망(40)은 분류망(30)으로부터 피드백된 추론 이미지의 적합성 점수가 향상되는 방향으로 잠재 벡터의 값을 갱신한다. 이러한 잠재 벡터 갱신에 의해 잠재 벡터는 이미지들간의 연관 관계에서의 해당 텍스트의 특징들을 더 잘 나타낼 수 있게 된다. 이러한 과정을 통해 생성망(40)이 학습된다. 생성망(40)의 학습량이 많아질수록 잠재 벡터는 이미지들간의 연관 관계에서의 해당 텍스트의 특징들을 점점 더 잘 나타낼 수 있게 된다. In step 2000, when at least one text input in step 1000 exists, the generating network 40 represents each text input in step 1000 through an embedding layer of each text input in step 1000, a latent vector representing this. convert to Next, the generation network 40 uses the latent vector converted as described above through the text processing layer to determine the characteristics of the correlation between the images represented by the at least two data input in step 1000 from the viewpoint of each text input in step 1000. infer When the suitability score for a certain inferred image is fed back from the classification network 30 to the generating network 40 , the generating network 40 is a latent vector in the direction that the suitability score of the inferred image fed back from the classification network 30 is improved. update the value of By updating the latent vector, the latent vector can better represent the characteristics of the text in the relation between images. Through this process, the generating network 40 is learned. As the learning amount of the generating network 40 increases, the latent vector can more and more better represent the characteristics of the corresponding text in the relation between images.

2000 단계에서 생성망(40)은 1000 단계에서 입력된 적어도 하나의 특징값이 존재하는 경우, 1000 단계에서 입력된 각 특징값의 임베딩 계층을 통하여 1000 단계에서 입력된 각 특징값을 이것을 나타내는 잠재 벡터로 변환한다. 이어서, 생성망(40)은 특징값 프로세싱 계층을 통하여 이와 같이 변환된 잠재 벡터를 이용하여 1000 단계에서 입력된 각 특징값 관점에서 1000 단계에서 입력된 적어도 두 개의 데이터가 나타내는 이미지들간의 연관 관계의 특징들을 추론한다. 분류망(30)으로부터 생성망(40)으로 어떤 추론 이미지에 대한 적합성 점수가 피드백된 경우, 생성망(40)은 분류망(30)으로부터 피드백된 추론 이미지의 적합성 점수가 향상되는 방향으로 잠재 벡터의 값을 갱신한다. 이러한 잠재 벡터 갱신에 의해 잠재 벡터는 이미지들간의 연관 관계에서의 해당 특징값의 특징들을 더 잘 나타낼 수 있게 된다. 이러한 과정을 통해 생성망(40)이 학습된다. 생성망(40)의 학습량이 많아질수록 잠재 벡터는 이미지들간의 연관 관계에서의 해당 특징값의 특징들을 점점 더 잘 나타낼 수 있게 된다. In step 2000, when at least one feature value input in step 1000 exists, the generating network 40 represents each feature value input in step 1000 through an embedding layer of each feature value input in step 1000, a latent vector representing this. convert to Next, the generation network 40 uses the latent vector converted as described above through the feature value processing layer to determine the correlation between the images represented by the at least two data input in step 1000 from the viewpoint of each feature value input in step 1000. Infer features. When the suitability score for a certain inferred image is fed back from the classification network 30 to the generating network 40 , the generating network 40 is a latent vector in the direction that the suitability score of the inferred image fed back from the classification network 30 is improved. update the value of By updating the latent vector, the latent vector can better represent the features of the corresponding feature value in the relation between images. Through this process, the generating network 40 is learned. As the learning amount of the generating network 40 increases, the latent vector can more and more better represent the features of the corresponding feature values in the relation between images.

본 실시예에 따르면, 생성망(40)의 학습이 진행되면서 생성망(40)과 분류망(30)간의 적대적 경쟁 관계에 의해 잠재 벡터는 이미지들간의 연관 관계에서의 해당 이미지, 해당 텍스트, 해당 특징값의 특징들을 점점 더 잘 나타낼 수 있게 됨에 따라 본 실시예가 제공하는 은닉 이미지의 정확도가 점점 더 높아지게 된다. According to the present embodiment, as the learning of the generating network 40 proceeds, the latent vector is the corresponding image, the corresponding text, and the corresponding image in the relation between the images due to the hostile competitive relationship between the generating network 40 and the classification network 30 . As the features of the feature value can be represented more and more better, the accuracy of the hidden image provided by the present embodiment becomes higher and higher.

상술한 바와 같이, 생성망(40)은 상술한 바와 같이 1000 단계에서 입력된 적어도 두 개의 데이터를 잠재 벡터들로 변환한다. 생성망(40)은 이와 같이 변환된 잠재 벡터들을 보간함으로써 그 잠재 벡터들 사이에 위치하는 적어도 하나의 잠재 벡터를 산출하고, 변환된 잠재 벡터들과 산출된 적어도 하나의 잠재 벡터를 이용하여 1000 단계에서 입력된 적어도 두 개의 데이터로부터 추론되는 적어도 하나의 이미지를 생성할 수도 있다. 2000 단계에서 변환된 잠재 벡터들은 사용자에 의해 획득된 이미지들에 대응되므로, 그것들의 보간에 의해 산출된 적어도 하나의 잠재 벡터는 사용자에 의해 획득된 이미지들 사이의 은닉 이미지들에 대응될 확률이 높다.As described above, the generating network 40 converts at least two pieces of data input in step 1000 into latent vectors as described above. The generation network 40 calculates at least one latent vector located between the latent vectors by interpolating the transformed latent vectors in this way, and uses the transformed latent vectors and the calculated at least one latent vector in 1000 steps. At least one image inferred from at least two pieces of data input in may be generated. Since the latent vectors transformed in step 2000 correspond to images acquired by the user, there is a high probability that at least one latent vector calculated by their interpolation corresponds to hidden images between images acquired by the user. .

즉, 2000 단계에서 변환된 잠재 벡터들의 보간에 의해 산출된 잠재 벡터는 은닉 이미지에 근접한 보다 많은 개수의 입력 이미지의 역할을 하게 되며, 이것들을 이용하여 추론된 이미지들간의 연관 관계의 특징들의 정확도가 높게 된다. 결과적으로, 2000 단계에서 변환된 잠재 벡터들의 보간이라는 간단한 연산을 통하여 생성망(40)으로부터 출력되는 은닉 이미지의 정확도를 대폭 향상시킬 수 있다. That is, the latent vector calculated by interpolation of the latent vectors transformed in step 2000 serves as a larger number of input images close to the hidden image, and the accuracy of the features of the correlation between images inferred using these goes high As a result, it is possible to significantly improve the accuracy of the hidden image output from the generation network 40 through a simple operation of interpolation of the latent vectors transformed in step 2000 .

3100 단계에서 생성망(40)은 완전접속계층을 통하여 2000 단계에서 추론된 복수 개의 특징들을 통합하고, 생성망(40)의 학습을 통해 구축된 이미지들간의 연관 관계 모델에 따라 이와 같이 통합된 특징들로부터 추론되는 이미지들을 생성함으로써 1000 단계에서 입력된 적어도 두 개의 데이터로부터 추론되는 이미지들을 생성한다. 생성망(40)은 학습 초기에는 여러 개의 이미지(이미지 대신에 텍스트, 특징값도 가능)가 입력되면 이것들간의 연관 관계를 이해할 수 있는 능력이 매우 떨어진다. 생성망(40)과 분류망(30)간의 적대적 경쟁 관계에 의해 학습량이 많아질수록 생성망(40)과 분류망(30)은 이것들간의 연관 관계를 더 정확히 이해할 수 있는 신경망 모델을 구축할 수 있게 된다. In step 3100, the generative network 40 integrates the plurality of features inferred in step 2000 through the full access layer, and the features integrated in this way according to the correlation model between the images built through the learning of the generative network 40 By generating images inferred from the data, images inferred from the at least two data input in step 1000 are generated. The generating network 40 has very poor ability to understand the relationship between several images (text and feature values are possible instead of images) at the initial learning stage. As the amount of learning increases due to the hostile competition relationship between the generative network 40 and the classification network 30, the generative network 40 and the classification network 30 can build a neural network model that can more accurately understand the relationship between them. there will be

3200 단계에서 생성망(40)은 3100 단계에서 생성된 추론 이미지들을 업샘플링함으로써 사용자가 요구하는 수준의 해상도를 갖는 추론 이미지를 생성한다. 3300 단계에서 생성망(40)은 3200 단계에서 생성된 각 추론 이미지를 분류망(30)에 입력한다. 3400 단계에서 분류망(30)은 제어 모듈(20)의 제어에 따라 1000 단계에서 입력된 적어도 두 개의 데이터로부터 3200 단계에서 생성된 각 추론 이미지의 적합성 점수를 결정한다. 3400 단계에 대해서는 아래에서 자세히 설명하기로 한다.In step 3200 , the generation network 40 generates an inference image having a level of resolution required by a user by upsampling the inferred images generated in step 3100 . In step 3300 , the generation network 40 inputs each inference image generated in step 3200 to the classification network 30 . In step 3400 , the classification network 30 determines a suitability score of each inferred image generated in step 3200 from at least two pieces of data input in step 1000 under the control of the control module 20 . Step 3400 will be described in detail below.

3500 단계에서 제어 모듈(20)은 3400 단계에서 결정된 각 추론 이미지의 적합성 점수와 미리 설정된 기준 점수를 비교한다. 그 비교 결과, 3400 단계에서 결정된 각 추론 이미지의 적합성 점수와 미리 설정된 기준 점수 미만이면 3600 단계로 진행하고 각 추론 이미지의 적합성 점수와 미리 설정된 기준 점수 이상이면 3700 단계로 진행한다.In step 3500, the control module 20 compares the suitability score of each inferred image determined in step 3400 with a preset reference score. As a result of the comparison, if the suitability score of each inferred image determined in step 3400 is less than the preset reference score, proceed to step 3600, and if the suitability score of each inferred image and the preset reference score are greater than or equal to the suitability score of each inference image, proceed to step 3700.

3600 단계에서 제어 모듈(20)은 각 추론 이미지의 적합성 점수를 생성망(40)에 입력함으로써 분류망(30)으로부터 생성망(40)으로 각 추론 이미지의 적합성 점수를 피드백한다. 여기에서, 기준 점수는 사용자에 의해 설정될 수 있으며 그 값이 클수록 각 은닉 이미지의 정확도가 향상될 수 있으나 본 실시예에 따라 제공되는 은닉 이미지의 개수가 감소할 수 있다. 한편, 기준 점수의 값이 작을수록 본 실시예에 따라 제공되는 은닉 이미지의 개수가 증가할 수 있으나 각 은닉 이미지의 정확도가 감소될 수 있다. 본 실시예에 따르면, 사용자의 기준 점수 조정에 따라 사용자가 원하는 품질과 개수의 은닉 이미지가 제공될 수 있다.In step 3600 , the control module 20 feeds back the suitability score of each inferred image from the classification network 30 to the generating network 40 by inputting the suitability score of each inferred image to the generating network 40 . Here, the reference score may be set by the user, and as the value increases, the accuracy of each hidden image may be improved, but the number of hidden images provided according to the present embodiment may decrease. Meanwhile, as the reference score decreases, the number of hidden images provided according to the present embodiment may increase, but the accuracy of each hidden image may decrease. According to the present embodiment, the number and quality of the hidden images desired by the user may be provided according to the adjustment of the user's reference score.

3700 단계에서 제어 모듈(20)은 3300 단계에서 생성된 추론 이미지로부터 기준 점수 이상인 적합성 점수를 갖는 추론 이미지를 분리해 낸 후에 스토리지(50)에 저장한다. 출력 모듈(60)은 제어 모듈(20)의 제어에 따라 이와 같이 스토리지(50)에 저장된 각 추론 이미지를 1000 단계에서 입력된 적어도 두 개의 데이터가 나타내는 이미지들 사이의 은닉 이미지로서 출력한다. 은닉 이미지는 사용자의 출력 명령 등에 따라 디스플레이 패널 등과 같은 출력 모듈(60)을 통하여 사용자에게 제공된다. In step 3700 , the control module 20 separates an inference image having a suitability score equal to or greater than the reference score from the inference image generated in step 3300 , and then stores it in the storage 50 . The output module 60 outputs each inferred image stored in the storage 50 as a hidden image between the images indicated by the at least two data input in step 1000 under the control of the control module 20 . The hidden image is provided to the user through the output module 60 such as a display panel according to the user's output command or the like.

도 5는 도 1에 도시된 분류망(30)의 이미지 적합성 점수 결정 방식을 설명하기 위한 개념도이다. 도 5의 (a)에는 일반적인 GAN의 사이클 컨시스턴시 로스(Cycle consistency loss) 결정 방식이 도시되어 있고, 도 5의 (b)에는 본 실시예의 이미지 적합성 점수 결정 방식이 도시되어 있다. FIG. 5 is a conceptual diagram for explaining a method of determining an image suitability score of the classification network 30 shown in FIG. 1 . Fig. 5 (a) shows a general GAN cycle consistency loss determination method, and Fig. 5 (b) illustrates an image suitability score determination method according to the present embodiment.

도 5의 (a)를 참조하면, 일반적인 GAN의 사이클 컨시스턴시 로스 결정 방식은 원본 이미지 X가 생성망에 의해 얼마나 정확하게 이미지 Y로 변환되었는지 판별하기 위해 가짜 이미지 Y를 역으로 변환함으로써 생성된 이미지 X'가 얼마나 원본 이미지 X와 동일한지를 보는 방식이다. 이와 같이, 종래의 GAN에서는 입력이 한 개의 이미지, 출력이 한 개의 이미지로 제한되나 본 실시예에서는 M개의 입력에 N개의 이미지 출력이 존재할 수 있다. 이것은 제한된 입력에 대해 다양한 출력을 보장할 수 있다는 장점을 제공한다. Referring to Fig. 5(a), the general GAN cycle consistency loss determination method is the image X' generated by inversely transforming the fake image Y to determine how accurately the original image X was transformed into the image Y by the generating network. It is a way of seeing how equal to the original image X is. As described above, in the conventional GAN, an input is limited to one image and an output is limited to one image, but in the present embodiment, there may be M inputs and N image outputs. This offers the advantage of being able to guarantee different outputs for a limited input.

도 5의 (b)를 참조하면, N개의 추론 이미지 X'에 매칭되는 입력은 M개의 이미지 X, 특징값 Xv, 및 텍스트 Xt가 될 수 있다. N개의 추론 이미지 X'에 매칭되는 입력은 M개의 이미지 X가 될 수도 있고, M 개의 특징값 Xv가 될 수도 있고, 및 M 개의 텍스트 Xt가 될 수도 있다. 또한, N개의 추론 이미지 X'에 매칭되는 입력은 M개의 이미지 X와 특징값 Xv가 될 수도 있고, M개의 이미지 X와 텍스트 Xt가 될 수도 있고, M개의 특징값 Xv와 텍스트 Xt이 될 수도 있다. 도 5의 (b)의 점선은 일반적인 사이클 컨시스턴시 로스 방식의 매칭을 나타내고, 실선은 본 실시예의 이미지 적합성 점수 결정 방식의 매칭을 나타낸다.Referring to FIG. 5B , inputs matching the N inferred images X' may be M images X, a feature value Xv, and a text Xt. Inputs matching the N inferred images X' may be M images X, M feature values Xv, and M text Xt. Also, the inputs matching the N inferred images X' may be M images X and feature values Xv, M images X and text Xt, or M feature values Xv and text Xt. . The dotted line in FIG. 5B indicates the matching of the general cycle consistency loss method, and the solid line indicates the matching of the image suitability score determination method of the present embodiment.

3400 단계에서 제어 모듈(20)은 3300 단계에서 생성된 각 추론 이미지를 생성망(40)에 입력함으로써 생성망(40)의 출력으로부터 적어도 하나의 이미지, 적어도 하나의 텍스트, 및 적어도 하나의 특징값 중 적어도 두 개의 데이터를 획득한다. 이어서, 분류망(30)은 이와 같은 역변환 과정을 통해 획득된 적어도 두 개의 데이터와 1000 단계에서 입력된 적어도 두 개의 데이터의 유사도에 따라 300 단계에서 생성된 각 추론 이미지의 적합성 점수를 결정한다. 따라서, 은닉 이미지의 추론에 충분한 입력 이미지를 얻기 힘든 환경에서도 이미지를 설명하는 텍스트나 이미지의 특징들을 나타내는 값을 이용하여 높은 정확도의 은닉 이미지를 추론할 수 있다.In step 3400 , the control module 20 inputs each inferred image generated in step 3300 to the generating network 40 , thereby generating at least one image, at least one text, and at least one feature value from the output of the generating network 40 . At least two pieces of data are acquired. Next, the classification network 30 determines the suitability score of each inferred image generated in step 300 according to the similarity between at least two data obtained through the inverse transformation process and at least two data input in step 1000 . Accordingly, even in an environment in which it is difficult to obtain an input image sufficient for inference of the hidden image, it is possible to infer the hidden image with high accuracy by using the text describing the image or a value representing the characteristics of the image.

분류망(30)의 적합성 점수 결정 과정은 역변환 과정을 통해 획득된 적어도 두 개의 데이터와 1000 단계에서 입력된 적어도 두 개의 데이터 각각에 대해 2000 단계에서의 생성망(40)의 잠재 벡터 변환, 연관 관계의 특징 추론 과정이 동일하게 진행되므로 상세한 설명은 생략하기로 한다. 다만, 3100 단계에서 생성망(40)은 완전접속계층을 통하여 추론 이미지들을 생성하나, 분류망(30)은 완전접속계층을 통하여 역변환 과정을 통해 획득된 적어도 두 개의 데이터와 1000 단계에서 입력된 적어도 두 개의 데이터 각각에 대해 추론된 특징들을 통합하고, 이와 같이 통합된 특징들이 나타내는 유사도에 따라 300 단계에서 생성된 각 추론 이미지의 적합성 점수를 결정한다.The suitability score determination process of the classification network 30 is a latent vector transformation of the generating network 40 in step 2000 for each of at least two data obtained through the inverse transformation process and at least two data input in step 1000, and correlation Since the feature inference process of is performed in the same manner, a detailed description thereof will be omitted. However, in step 3100, the generating network 40 generates inferred images through the fully accessed layer, but the classification network 30 includes at least two data obtained through an inverse transformation process through the fully accessed layer and at least two data inputted in step 1000. The features inferred for each of the two pieces of data are integrated, and a suitability score of each inferred image generated in step 300 is determined according to the degree of similarity indicated by the integrated features.

본 실시예에 따르면, 입력 이미지와 출력 이미지의 개수는 사용자의 결정에 따라 임의적으로 설정될 수 있다. 즉, 1개의 입력 이미지로부터 여러 개의 은닉 이미지가 출력될 수 있기 때문에 스포츠 게임에서 특정 장면에서 나온 이미지로부터 드론이나 카메라로 촬영이 힘든 장면을 360도 영상으로 실현할 수 있다. 본 실시예의 생성망(40)과 분류망(30)은 실제 CCTV에서 사용하는 풀 HD 사이즈의 가로:세로 비율이 16:9인 이미지를 지원할 수 있다. 종래의 GAN에서는 이미지의 가로:세로 비율을 1:1만 지원함에 따라 실제 활용도가 떨어졌었다.According to the present embodiment, the number of input images and output images may be arbitrarily set according to a user's decision. That is, since several hidden images can be output from one input image, a scene that is difficult to shoot with a drone or camera can be realized as a 360-degree video from an image from a specific scene in a sports game. The generation network 40 and the classification network 30 of this embodiment may support an image having a width:vertical ratio of 16:9 of a full HD size used in an actual CCTV. In the conventional GAN, the practical use of the image was reduced as only 1:1 was supported in the horizontal:vertical ratio.

도 6은 도 4에 도시된 1000 단계에서 입력되는 모든 데이터가 이미지인 경우의 예시도이다. 예를 들어, 다음과 같은 상황을 가정해 보자. 첫 번째 이미지(51`)는 어떤 사람이 길거리에 서있는 것을 보여준다. 두 번째 이미지(52`)는 그 사람이 집에서 방금 나왔음을 보여준다. 세 번째 이미지(53`)는 그 사람은 단순히 지나가는 중임을 보여준다. 네 번째 이미지(54`)는 다음 도착지가 반대편에 있는 집임을 보여준다. 다섯 번째 이미지(55`)는 그 사람이 헤드폰을 착용하고 걸어가고 있음을 보여준다. 여섯 번째 이미지(56`)는 그 사람이 주머니에서 선글라스를 꺼내고 있음을 보여준다. 일곱 번째 이미지(57`)는 그 사람이 선글라스를 착용하고 있음을 보여준다. 여덟 번째 이미지(58`)는 그 사람이 선글라스를 착용하고 헤드폰을 착용한 채로 걸어가고 있음을 보여준다.FIG. 6 is an exemplary diagram in which all data input in step 1000 shown in FIG. 4 are images. For example, let's assume the following situation. The first image (51') shows a person standing in the street. The second image (52') shows that the person has just left the house. The third image 53' shows that the person is simply passing by. The fourth image (54') shows that the next destination is the house on the other side. The fifth image (55') shows the person wearing headphones and walking. The sixth image (56') shows the person taking the sunglasses out of his pocket. The seventh image (57') shows the person wearing sunglasses. The eighth image (58') shows the person walking while wearing sunglasses and headphones.

분류망(30)과 생성망(40)은 입력 모듈(10)로부터 입력된 8개의 이미지(51`, 52`, 53`, 54`, 55`, 56`, 57`, 58`)에서 변환이 필요한 부분만 잘라내어 그 각각의 학습에 사용할 수 있다. 이와 같이 잘라낸 8개의 이미지(51, 52, 53 54, 55, 56, 57, 58)는 1920Х1080Х3의 풀 HD 이미지일 수 있다. 본 실시예에서는 이러한 이미지의 입력 대신에 어떤 이미지가 나타내는 장면을 설명하는 텍스트가 입력될 수도 있고, 어떤 이미지의 특징들을 나타내는 값이 입력될 수도 있다.The classification network 30 and the generation network 40 are converted from the eight images (51', 52', 53', 54', 55', 56', 57', 58') input from the input module 10 . You can cut out only the necessary parts and use them for each learning. The eight cropped images 51, 52, 53 54, 55, 56, 57, and 58 may be a full HD image of 1920Х1080Х3. In the present embodiment, instead of inputting such an image, text describing a scene represented by an image may be input, or a value indicating characteristics of a certain image may be input.

도 7은 도 4에 도시된 1000 단계에서 입력되는 데이터 일부가 특징값인 경우의 예시도이다. 도 7을 참조하면, 첫 번째 유형(T1)은 진하게 표시된 부분으로서 이미지 입력 대신에 촬영 대상 등 어떤 대상의 좌표 값, 이동속도 값, 상대적인 위치 값, 복장 종류 코드, 복장 색상 코드 등 이미지의 특징들을 나타내는 값으로 대체된 부분이다. 두 번째 유형(T2)은 연하게 표시된 부분으로서 이미지 입력이 존재하는 부분이다. FIG. 7 is an exemplary diagram of a case in which a part of data input in step 1000 shown in FIG. 4 is a feature value. Referring to FIG. 7 , the first type (T1) is a part displayed in bold, and instead of inputting an image, image characteristics such as coordinate values, movement speed values, relative position values, clothing type codes, clothing color codes, etc. The part replaced with the indicated value. The second type (T2) is a portion in which an image input exists as a light-marked portion.

예를 들어, M=8개의 이미지 중에 L=2개의 이미지가 없는 경우를 가정해 보자. 이미지(52')의 입력이 없어서 이미지(52)가 0.32(좌표), 0.42(이동속도), 0.25(상대적인 위치), 0.211(복장 종류 코드)로 구성된 특징값으로 대체될 수 있고, 이미지(55')의 입력이 없어서 이미지(55)가 0.73(좌표), 0.48(이동속도), 0.05(상대적인 위치), 0.2225(복장 종류 코드)로 구성된 특징값으로 대체될 수 있다. 이 경우 이미지를 대체한 특징값의 개수 L은 2가 된다. 입력 모듈(10)은 M-L=8-2개의 이미지(51, 53, 54, 56, 57, 58)와 L=2개의 특징값(52, 55)을 분류망(30)과 생성망(40) 각각에 입력하게 된다. 잠재 벡터(z)의 사이즈가 K일 때에 분류망(30)과 생성망(40) 각각에는 이 입력에 대응하여 K 사이즈의 M 개의 잠재 벡터가 할당된다.For example, assume that there are no L=2 images among M=8 images. Since there is no input of the image 52', the image 52 can be replaced with a feature value consisting of 0.32 (coordinates), 0.42 (movement speed), 0.25 (relative position), 0.211 (clothing type code), and the image 55 '), the image 55 can be replaced with a feature value consisting of 0.73 (coordinate), 0.48 (movement speed), 0.05 (relative position), and 0.2225 (clothing type code). In this case, the number L of feature values substituted for the image becomes 2. The input module 10 receives ML = 8-2 images (51, 53, 54, 56, 57, 58) and L = 2 feature values (52, 55) into the classification network 30 and the generating network 40 enter each. When the size of the latent vector z is K, each of the classification network 30 and the generation network 40 is allocated M latent vectors of the size K corresponding to this input.

도 8은 도 4에 도시된 1000 단계에서 입력되는 데이터 일부가 텍스트인 경우의 예시도이다. 도 8을 참조하면, 첫 번째 유형(T1)은 진하게 표시된 부분으로서 이미지 입력 대신에 이미지의 장면을 설명하는 텍스트로 대체된 부분이다. 두 번째 유형(T2)은 연하게 표시된 부분으로서 이미지 입력이 존재하는 부분이다. FIG. 8 is an exemplary diagram when a part of data input in step 1000 shown in FIG. 4 is text. Referring to FIG. 8 , the first type T1 is a portion indicated in bold, and is a portion replaced with text describing a scene of an image instead of an image input. The second type (T2) is a portion in which an image input exists as a light-marked portion.

예를 들어, M=8개의 이미지 중에 L=3개의 이미지가 없는 경우를 가정해 보자. 이미지(52')의 입력이 없어서 이미지(52)가 "He was passing by the red brick house and..." 텍스트로 대체될 수 있고, 이미지(55')의 입력이 없어서 이미지(55)가 "the camera shows his right side of his face..." 텍스트로 대체될 수 있고, 이미지(57')의 입력이 없어서 이미지(57)가 "He wore the sunglass and his top left side.." 텍스트로 대체될 수 있다. 이 경우 이미지를 대체한 텍스트의 개수 L은 3이 된다. 입력 모듈(10)은 M-L=8-3=5 개의 이미지(51, 53, 54, 56, 58)와 L=3개의 텍스트(52, 55, 57)를 분류망(30)과 생성망(40) 각각에 입력하게 된다. 잠재 벡터(z)의 사이즈가 K일 때에 분류망(30)과 생성망(40) 각각에는 이 입력에 대응하여 K 사이즈의 M 개의 잠재 벡터가 할당된다.For example, suppose there are no L=3 images among M=8 images. Since there is no input in image 52', image 52 may be replaced with the text "He was passing by the red brick house and...", and image 55 may be replaced by the text "He was passing by the red brick house and..." the camera shows his right side of his face..." can be replaced with the text "He wore the sunglass and his top left side.." as there is no input for the image (57'), so the image (57) is replaced with the text "He wore the sunglass and his top left side.." can be In this case, the number L of texts substituted for images becomes 3. The input module 10 receives ML=8-3=5 images (51, 53, 54, 56, 58) and L=3 texts (52, 55, 57) into the classification network 30 and the generating network 40 ) is entered for each. When the size of the latent vector z is K, each of the classification network 30 and the generation network 40 is allocated M latent vectors of the size K corresponding to this input.

도 9는 도 4에 도시된 1000 단계에서 입력되는 데이터 일부가 특징값과 텍스트인 경우의 예시도이다. 도 9를 참조하면, 첫 번째 유형(T1)은 진하게 표시된 부분으로서 이미지 입력 대신에 이미지의 특징들을 나타내는 값과 이미지의 장면을 설명하는 텍스트로 대체된 부분이다. 두 번째 유형(T2)은 연하게 표시된 부분으로서 이미지 입력이 존재하는 부분이다. FIG. 9 is an exemplary diagram of a case in which a part of data input in step 1000 shown in FIG. 4 is a feature value and text. Referring to FIG. 9 , the first type T1 is a portion displayed in bold, and is a portion in which values representing features of an image and text describing a scene of the image are replaced instead of inputting an image. The second type (T2) is a portion in which an image input exists as a light-marked portion.

예를 들어, M=8개의 이미지 중에 L=4개의 이미지가 없는 경우를 가정해 보자. 이미지(52')의 입력이 없어서 이미지(52)가 "He was passing by the red brick house and ..." 텍스트로 대체될 수 있고, 이미지(55')의 입력이 없어서 이미지 (55)가 "the camera shows his right side of his face..." 텍스트로 대체될 수 있고, 이미지(57')의 입력이 없어서 이미지(57)가 "He wore the sunglass and his top left side.." 텍스트로 대체될 수 있고, 이미지(53')의 입력이 없어서 이미지 (53)가 "0.93(좌표), 0.42(이동속도), 0.15(상대적인 위치), 0.605(복장 종류 코드)로 구성된 특징값으로 대체될 수 있다. 이 경우 이미지를 대체한 데이터의 개수 L은 4가 된다. 입력 모듈(10)은 M-L=8-4 개의 이미지(51, 54, 56, 58)와 L=4개의 데이터 테이블 데이터(52, 53, 55, 57)를 분류망(30)과 생성망(40) 각각에 입력하게 된다. 잠재 벡터(z)의 사이즈가 K일 때에 분류망(30)과 생성망(40) 각각에는 이 입력에 대응하여 K 사이즈의 M 개의 잠재 벡터가 할당된다.For example, suppose that there are no L=4 images among M=8 images. Since there is no input for image 52', image 52 can be replaced with the text "He was passing by the red brick house and ...", and image (55) has " the camera shows his right side of his face..." can be replaced with the text "He wore the sunglass and his top left side.." as there is no input for the image (57'), so the image (57) is replaced with the text "He wore the sunglass and his top left side.." Since there is no input of image 53', image 53 can be replaced with a feature value consisting of "0.93 (coordinate), 0.42 (movement speed), 0.15 (relative position), 0.605 (clothing type code). In this case, the number L of data substituted for images becomes 4. The input module 10 includes ML = 8-4 images 51, 54, 56, 58 and L = 4 data table data 52, 53, 55, 57) are inputted to each of the classification network 30 and the generator network 40. When the size of the latent vector z is K, this input is applied to the classification network 30 and the generator network 40, respectively. M latent vectors of size K are allocated corresponding to .

도 10은 도 6에 도시된 예에 대한 생성망(40)의 데이터 처리 과정도이다. 도 10에는 도 4에 도시된 1000 단계에서 입력되는 모든 데이터가 이미지인 경우에 여러 개의 이미지를 모아서 스택 이미지로 만든 뒤, 이것을 3300 단계에서의 출력 이미지들로 재구성하는 과정이 도시되어 있다. 하나의 이미지는 R(red) 채널, G(green) 채널, B(blue) 채널의 3채널 이미지로 구성된다. 예를 들어, 생성망(40)은 MХ3개의 1920Х1080 풀 HD 사이즈의 채널 이미지를 NХ3개의 1920Х1080 풀 HD 사이즈의 채널 이미지로 재구성할 수 있다.FIG. 10 is a data processing process diagram of the generating network 40 for the example shown in FIG. 6 . In FIG. 10 , when all the data input in step 1000 shown in FIG. 4 are images, a process of reconstructing the images as output images in step 3300 is shown after collecting several images to form a stack image. One image is composed of a three-channel image of an R (red) channel, a G (green) channel, and a B (blue) channel. For example, the generating network 40 may reconstruct MХ3 pieces of 1920Х1080 full HD size channel images into NХ3 pieces of 1920Х1080 full HD size channel images.

도 11은 도 10에 도시된 예에 대한 생성망(40)의 데이터 처리 과정도이다. 도 10에는 도 4에 도시된 1000 단계에서 입력되는 데이터가 적어도 하나의 이미지, 적어도 하나의 텍스트, 및 적어도 하나의 특징값의 조합인 경우에 이것을 3300 단계에서의 출력 이미지들로 재구성하는 과정이 도시되어 있다. 예를 들어, 생성망(40)은 (M-L)Х3개의 1920Х1080 풀 HD 사이즈의 채널 이미지를 NХ3개의 1920Х1080 풀 HD 사이즈의 채널 이미지로 재구성할 수 있다. 이 경우, 적어도 하나의 텍스트와 적어도 하나의 특징값으로부터 변환된 L개의 K사이즈의 잠재 벡터가 존재한다. 도 11에 도시된 바와 같이, (M-L)Х3=(8-4)Х3=4Х3개의 채널 이미지로부터 NХ3=12Х3개의 채널 이미지로 재구성될 수 있다.FIG. 11 is a data processing process diagram of the generating network 40 for the example shown in FIG. 10 . FIG. 10 shows a process of reconstructing the data input in step 1000 shown in FIG. 4 into output images in step 3300 when the input data is a combination of at least one image, at least one text, and at least one feature value. has been For example, the generating network 40 may reconstruct (M-L)Х3 channel images of 1920Х1080 full HD size into NХ3 channels of 1920Х1080 full HD size channel images. In this case, there are L latent vectors of size K converted from at least one text and at least one feature value. 11 , it can be reconstructed from (M-L)Х3=(8-4)Х3=4Х3 channel images to NХ3=12Х3 channel images.

도 12는 도 4에 도시된 특징값 프로세싱과 텍스트 프로세싱의 상세도이다. 도 12를 참조하면, 도 4에 도시된 텍스트 프로세싱은 복수 개의 컨볼루션 계층(convolution layer), 완전 접속 계층(fully connected layer), 탈락 계층(dropout layer), 플래튼 계층(flatten), 및 리세이프 계층(reshape layer)을 통하여 이루어질 수 있다. 컨볼루션 계층은 텍스트에 대응하는 잠재 벡터를 이용하여 텍스트의 특징들을 추출함으로써 3차원 특징 맵을 생성한다. 완전 접속 계층은 3차원 특징 맵이 나타내는 텍스트의 특징들을 통합함으로써 3차원 특징 맵이 텍스트 관점에서의 이미지들간의 연관 관계의 특징들을 나타내도록 한다. 탈락 계층은 3차원 특징 맵의 사이즈를 줄이기 위하여 이와 같이 추론된 특징들을 구성하는 값들 중 특정 값 이하의 값을 탈락시킨다. 플래튼 계층은 3차원 특징 맵을 텍스트 관점에서의 이미지들간의 연관 관계의 특징들을 나타내는 1차원의 특징 데이터로 변환한다. 리세이프 계층은 다음 단계에서의 데이터 병합을 위해 1차원의 특징 데이터를 텍스트 관점에서의 이미지들간의 연관 관계에서의 특징들을 나타내는 K 사이즈의 특징 데이터로 변환한다.FIG. 12 is a detailed diagram of feature value processing and text processing shown in FIG. 4 . Referring to FIG. 12 , the text processing shown in FIG. 4 includes a plurality of convolution layers, a fully connected layer, a dropout layer, a platen layer, and a resafe This can be done through a reshape layer. The convolutional layer generates a three-dimensional feature map by extracting features of the text using a latent vector corresponding to the text. The fully connected layer integrates the features of the text represented by the three-dimensional feature map so that the three-dimensional feature map represents the characteristics of the association relationship between images from a textual point of view. In order to reduce the size of the 3D feature map, the dropout layer drops out values less than or equal to a specific value among the values constituting the inferred features. The platen layer transforms the three-dimensional feature map into one-dimensional feature data representing the characteristics of the relation between images from a textual point of view. The RESAFE layer converts one-dimensional feature data into K-size feature data representing features in a relation between images from a text point of view for data merging in the next step.

도 4에 도시된 특징값 프로세싱은 복수 개의 다층퍼셉트론 계층(Multilayer Perceptron layer), 플래튼 계층, 및 리세이프 계층을 통하여 이루어질 수 있다. 다층퍼셉트론 계층은 특징값에 대응하는 잠재 벡터를 이용하여 특징값 관점에서의 이미지들간의 연관 관계의 특징들을 추론함으로써 특징값 관점에서의 이미지들간의 연관 관계의 특징들을 나타내는 3차원 특징 맵을 생성한다. 플래튼 계층은 3차원 특징 맵을 특징값 관점에서의 이미지들간의 연관 관계의 특징들을 나타내는 1차원의 특징 데이터로 변환한다. 리세이프 계층은 다음 단계에서의 데이터 병합을 위해 1차원의 특징 데이터를 특징값 관점에서의 이미지들간의 연관 관계의 특징들을 나타내는 K 사이즈의 특징 데이터로 변환한다.The feature value processing shown in FIG. 4 may be performed through a plurality of multilayer perceptron layers, platen layers, and resafe layers. The multilayer perceptron layer generates a three-dimensional feature map representing the characteristics of the association between images in the feature value point of view by inferring the features of the association relationship between the images in the feature value point of view using the latent vector corresponding to the feature value. . The platen layer converts the three-dimensional feature map into one-dimensional feature data representing the features of the correlation between images in terms of feature values. The RESAFE layer converts one-dimensional feature data into K-size feature data representing features of correlation between images in terms of feature values for data merging in the next step.

도 13은 도 1에 도시된 분류망(30)과 생성망(40)의 학습 예를 도시한 도면이다. 도 13에는 CCTV가 어떤 인물을 중심으로 회전하면서 연속 촬영한 얼굴 이미지들이 도시되어 있다. 이러한 얼굴 이미지들 중 일부는 CCTV에 의해 촬영된 이미지로 가정하고, 나머지는 CCTV에 의해 촬영이 불가능한 이미지로 가정한다. 생성망(40)에는 CCTV에 의해 촬영된 이미지로 가정된 이미지들이 입력되고, 그 입력에 따라 생성망(40)으로부터 제 1 추론 이미지들이 출력된다. 생성망(40)에는 CCTV에 의해 촬영이 불가능한 이미지로 가정된 이미지들이 입력되고, 그 입력에 따라 제 2 추론 이미지들이 출력된다. 여기에서, 제 1 추론 이미지들은 GAN의 가짜 이미지에 해당하고 제 2 추론 이미지들은 역변환 이미지에 해당한다. 13 is a diagram illustrating an example of learning of the classification network 30 and the generating network 40 shown in FIG. 1 . 13 shows face images continuously taken while the CCTV rotates around a certain person. Some of these face images are assumed to be images captured by CCTV, and the rest are assumed to be images that cannot be captured by CCTV. Images assumed to be images captured by CCTV are input to the generating network 40 , and first inferred images are output from the generating network 40 according to the input. Images that are assumed to be images that cannot be photographed by CCTV are input to the generating network 40 , and second inferred images are output according to the input. Here, the first inferred images correspond to the fake images of the GAN and the second inferred images correspond to the inverse transformed images.

분류망(30)에는 생성망(40)으로부터 출력된 제 1 추론 이미지들과 제 2 추론 이미지들이 입력되고, 그 입력에 따라 분류망(30)으로부터 제 1 추론 이미지들 각각의 적합성 점수가 출력된다. 생성망(40)으로부터 출력된 제 1 추론 이미지들과 CCTV에 의해 촬영이 불가능한 이미지로 가정된 이미지들간의 유사도에 따라 분류망(30)이 1에 근접한 값을 출력하도록 분류망(30)을 학습시킨다. 분류망(30)으로부터 출력된 적합성 점수가 1이라면 제 1 추론 이미지들과 CCTV에 의해 촬영이 불가능한 이미지로 가정된 이미지들은 일치하게 된다.The first inferred images and the second inferred images output from the generation network 40 are input to the classification network 30 , and a suitability score of each of the first inferred images is output from the classification network 30 according to the input. . The classification network 30 is trained so that the classification network 30 outputs a value close to 1 according to the similarity between the first inferred images output from the generation network 40 and images assumed to be images that cannot be photographed by CCTV. make it If the suitability score output from the classification network 30 is 1, the first inferred images and the images assumed to be images that cannot be photographed by the CCTV match.

생성망(40)은 분류망(30)으로부터 출력된 적합성 점수가 향상되는 방향으로 이미지들간의 연관 관계를 나타내는 잠재 벡터들의 값을 갱신하고, 이와 같이 갱신된 잠재 벡터들을 이용하여 CCTV에 의해 촬영된 이미지로 가정된 이미지들로부터 추론되는 이미지들을 생성한다. 이러한 잠재 벡터 갱신에 의해 각 잠재 벡터는 이미지들간의 연관 관계를 더 표현할 수 있게 된다. 이와 같이 생성된 추론 이미지들은 다시 분류망(30)으로 입력되고, 분류망(30)은 각 추론 이미지의 적합성 점수를 출력하게 된다. 상술한 바와 같은 과정은 계속적으로 반복되면서 생성망(40)은 CCTV에 의해 촬영이 불가능한 이미지로 가정된 이미지들에 점점 더 근접한 추론 이미지들을 생성할 수 있게 되고, 분류망(30)은 촬영이 불가능한 이미지로 가정된 이미지들과 추론 이미지들간의 유사도를 판별하는 능력이 점점 더 향상된다.The generation network 40 updates the values of latent vectors representing the correlation between images in the direction in which the suitability score output from the classification network 30 is improved, and the value of latent vectors captured by CCTV using the updated latent vectors Generate images inferred from images assumed to be images. By updating the latent vectors, each latent vector can further express the relationship between images. The inference images generated in this way are again input to the classification network 30 , and the classification network 30 outputs a suitability score of each inferred image. As the process as described above is continuously repeated, the generating network 40 is able to generate inferred images that are closer to images assumed to be images that cannot be photographed by CCTV, and the classification network 30 becomes impossible to photograph. The ability to discriminate the similarity between images assumed to be images and inferred images is increasingly improved.

이상에서는 이미지들만을 이용하여 분류망(30)과 생성망(40)을 학습시키는 예를 살펴보았으나 이미지 대신에 상술한 바와 같은 텍스트나 특징값을 분류망(30)과 생성망(40)을 학습시킬 수 있다. 특히, 본 실시예에 따르면 분류망(30)으로부터 출력된 적합성 점수가 기준 점수 미만인 경우에만 생성망(40)으로 피드백되기 때문에 분류망(30)과 생성망(40)의 경쟁 학습이 사용자가 요구하는 은닉 이미지의 품질과 무관하게 반복됨을 방지할 수 있다. 즉, 본 실시예에 따르면 기존의 GAN에 비해 본 실시예가 구현되는 컴퓨터와 같은 하드웨어에 데이터 처리 부담을 경감시키면서 사용자가 요구하는 품질의 은닉 이미지가 효율적으로 제공될 수 있다.In the above, an example of learning the classification network 30 and the generating network 40 using only images has been described, but instead of the image, the classification network 30 and the generating network 40 are used for text or feature values as described above. can learn In particular, according to the present embodiment, since the suitability score output from the classification network 30 is fed back to the generating network 40 only when it is less than the reference score, competitive learning between the classification network 30 and the generating network 40 is required by the user. It can prevent repetition regardless of the quality of the hidden image. That is, according to the present embodiment, compared with the existing GAN, the hidden image of the quality required by the user can be efficiently provided while reducing the data processing burden on hardware such as a computer in which the present embodiment is implemented.

도 14는 도 4에 도시된 은닉 이미지 추론 방법의 활용 예를 도시한 도면이다. 도 14에는 축구 경기중인 선수의 1개의 입력 이미지로부터 그 선수를 중심으로 360도에 걸쳐 촬영했을 경우의 12 개의 출력 이미지가 추론될 수 있다. 본 실시예에 따르면, 입력 이미지가 하나만 존재하는 경우에도 그 입력 이미지와 연관된 이미지의 장면을 설명하는 텍스트나 이미지의 특징값을 이미지 대신에 입력할 경우에 그 입력 이미지와 연관된 은닉 이미지가 추론될 수 있다. 나아가, 입력 이미지가 없는 경우에도 어떤 장면을 설명하는 텍스트나 이미지의 특징값으로부터 은닉 이미지가 추론될 수 있다. 14 is a diagram illustrating an example of application of the hidden image inference method shown in FIG. 4 . In FIG. 14 , 12 output images when photographed over 360 degrees around the player can be inferred from one input image of a player during a soccer game. According to the present embodiment, even when there is only one input image, when text describing a scene of an image associated with the input image or a feature value of an image is input instead of an image, a hidden image associated with the input image can be inferred. there is. Furthermore, even in the absence of an input image, a hidden image may be inferred from a text or image feature value describing a certain scene.

본 실시예에 따르면, 건물 등 장애물에 가려 보이지 않거나 CCTV가 촬영할 수 없는 사각지대의 보이지 않는 대상에 대한 이미지를 그 주변 이미지, 텍스트, 이미지 특징값을 이용하여 제공할 수 있다. 도주자의 얼굴 사진이 한 장만 있거나 없는 경우에도 도주자 얼굴 설명이나 특징값의 입력만으로 도주자의 360도 얼굴 이미지들을 제공할 수 있다. 도주 경로의 사진이 한 장만 있거나 없는 경우에도 도주 경로 설명이나 특징값의 입력만으로 나머지 도주 경로의 이미지들을 제공할 수 있다. According to this embodiment, it is possible to provide an image of an invisible object in a blind spot that cannot be photographed by an obstacle, such as a building, or by using the surrounding image, text, and image feature values. Even when there is only one face picture of the fugitive, 360-degree face images of the fugitive can be provided only by inputting a description of the fugitive face or a feature value. Even when there is only one picture of the escape route, images of the other escape routes may be provided only by the description of the escape route or input of a feature value.

한편, 상술한 바와 같은 본 발명의 일 실시예에 따른 은닉 이미지 추정 방법은 컴퓨터의 프로세서에서 실행 가능한 프로그램으로 작성 가능하고, 이 프로그램을 컴퓨터로 읽을 수 있는 기록매체에 기록하여 실행시키는 컴퓨터에서 구현될 수 있다. 컴퓨터는 데스크탑 컴퓨터, 노트북 컴퓨터, 스마트폰, 임베디드 타입의 컴퓨터 등 프로그램을 실행시킬 수 있는 모든 타입의 컴퓨터를 포함한다. 또한, 상술한 본 발명의 일 실시예에서 사용된 데이터의 구조는 컴퓨터로 읽을 수 있는 기록매체에 여러 수단을 통하여 기록될 수 있다. 컴퓨터로 읽을 수 있는 기록매체는 램(RAM), 롬(ROM), 마그네틱 저장매체(예를 들면, 플로피 디스크, 하드디스크 등), 광학적 판독 매체(예를 들면, 시디롬, 디브이디 등)와 같은 저장매체를 포함한다.On the other hand, the method for estimating the hidden image according to an embodiment of the present invention as described above can be written as a program executable by the processor of the computer, and can be implemented in a computer that records the program in a computer-readable recording medium and executes it. can The computer includes any type of computer capable of executing a program, such as a desktop computer, a notebook computer, a smart phone, and an embedded type computer. In addition, the structure of the data used in the embodiment of the present invention described above may be recorded in a computer-readable recording medium through various means. The computer-readable recording medium includes storage such as RAM, ROM, magnetic storage medium (eg, floppy disk, hard disk, etc.), and optically readable medium (eg, CD-ROM, DVD, etc.). includes media.

이제까지 본 발명에 대하여 그 바람직한 실시예들을 중심으로 살펴보았다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형상으로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.So far, with respect to the present invention, the preferred embodiments have been looked at. Those of ordinary skill in the art to which the present invention pertains will understand that the present invention can be implemented in a modified shape without departing from the essential characteristics of the present invention. Therefore, the disclosed embodiments are to be considered in an illustrative rather than a restrictive sense. The scope of the present invention is indicated in the claims rather than the foregoing description, and all differences within the scope equivalent thereto should be construed as included in the present invention.

10 ... 입력 모듈
20 ... 제어 모듈
30 ... 분류망
40 ... 생성망
50 ... 스토리지
60 ... 출력 모듈10 ... input module
20 ... control module
30 ... classification networks
40 ... generative networks
50 ... storage
60 ... output module

Claims

inputting at least two data of at least one image, at least one text, and at least one feature value;
generating, by the network, generating at least one image inferred from the input at least two data according to a correlation model between images built through learning;
determining, by the classification network, a suitability score of each of the generated inference images; and
and outputting each of the generated inference images as a hidden image between the images represented by the input at least two data based on the comparison result of the suitability score and the reference score of the generated inferred image, characterized in that it comprises the steps of: Hidden image inference method.

The method of claim 1,
The determining of the suitability score comprises determining the suitability score of each inferred image from the input at least two pieces of data.

3. The method of claim 2,
The determining of the suitability score includes at least two data of at least one image, at least one text, and at least one feature value generated by inputting each of the generated inference images to the generation network and the input at least two A hidden image inference method, characterized in that the suitability score is determined according to the similarity of dog data.

The method of claim 1,
The step of generating the inference image is
converting at least two of the at least one image, at least one text, and at least one feature value into latent vectors;
calculating a latent vector located between the transformed latent vectors by interpolating the transformed latent vectors; and
and generating at least one image inferred from the input at least two data by using the transformed latent vectors and the calculated latent vector.

The method of claim 1,
The step of generating the inference image is
converting each input image into a latent vector; and
and inferring characteristics of a correlation between the images represented by the at least two input data from the viewpoint of each input image by using the transformed latent vector.

The method of claim 1,
The step of generating the inference image is
converting each inputted text into a latent vector; and
and inferring characteristics of a correlation between images represented by the at least two input data from the viewpoint of each input text by using the transformed latent vector.

The method of claim 1,
The step of generating the inference image is
converting each input feature value into a latent vector; and
and inferring characteristics of a correlation between the images represented by the at least two input data from the viewpoint of each input feature value by using the transformed latent vector.

The method of claim 1,
The step of generating the inference image is
converting the input at least one image, at least one text, and at least one feature value into latent vectors;
inferring characteristics of a correlation between images represented by the at least two input data using the transformed latent vectors; and
Generating at least one image inferred from the input at least two data by integrating the inferred features and generating at least one image inferred from the integrated features according to the correlation model Hidden image inference method, characterized in that.

A computer-readable recording medium in which a program for executing the method of any one of claims 1 to 8 on a computer is recorded.

an input module for inputting at least two data of at least one image, at least one text, and at least one feature value;
a generation network that generates at least one image inferred from the input at least two data according to a correlation model between images built through learning;
a classification network that determines a suitability score of each of the generated inference images; and
and an output module for outputting each generated inference image as a hidden image between images represented by the input at least two data based on a comparison result of the suitability score and the reference score of the generated inference image A hidden image inference device.