KR20220112655A

KR20220112655A - Apparatus and method for providing augmented reality-based video conference for multi-party online business collaboration

Info

Publication number: KR20220112655A
Application number: KR1020210117824A
Authority: KR
Inventors: 임지숙; 하태원
Original assignee: (주)스마트큐브
Priority date: 2021-02-04
Filing date: 2021-09-03
Publication date: 2022-08-11
Also published as: KR102472115B1

Abstract

The present invention provides a device for providing an augmented reality-based video conference for multi-party online business collaboration and a method for the same, capable of providing all users participating in a video conference with realistic user experience about an object. The method for providing an augmented reality-based video conference includes: a step of generating a background coordinate vector expressed by a three-dimensional coordinate by using a conversion model learned through deep learning from a background local image comprising only a background in a local image by a coordinate generating unit and of generating an object coordinate vector expressed by the three-dimensional coordinate by using the conversion model from the object local image comprising only the object in the local image; a step of mapping the background local image with the background coordinate vector by an augmentation unit and mapping the object local image with the object coordinate vector; a step of generating an augmented image by matching the object local image mapped with the object coordinate vector to the background local image mapped with the background coordinate vector according to the three-dimensional coordinate of the object coordinate vector and the background coordinate vector by the augmentation unit; and a step of providing the augmented image for a user device participating in the video conference by the augmentation unit.

Description

Apparatus and method for providing augmented reality-based video conference for multi-party online business collaboration

본 발명은 화상회의를 제공 기술에 관한 것으로, 보다 상세하게는, 다자간 온라인 업무 협업을 위한 증강현실(AR: augmented reality) 기반의 화상회의를 제공하기 위한 장치 및 이를 위한 방법에 관한 것이다. The present invention relates to a technology for providing a video conference, and more particularly, to an apparatus and method for providing an augmented reality (AR)-based video conference for multi-party online business collaboration.

증강현실(AR: augmented reality)은 현실 세계에 컴퓨터 기술로 만든 가상물체 및 정보를 융합, 보완해 주는 기술을 말한다. 현실 세계에 실시간으로 부가정보를 갖는 가상 세계를 더해 하나의 영상으로 보여준다. Augmented reality (AR) refers to a technology that fuses and supplements the real world with virtual objects and information created by computer technology. The virtual world with additional information in real time is added to the real world and displayed as a single image.

한국공개특허 제2015-0099401호 (2015년 08월 31일 공개)Korea Patent Publication No. 2015-0099401 (published on August 31, 2015)

본 발명의 목적은 다자간 온라인 업무 협업을 위한 증강현실 기반의 화상회의를 제공하기 위한 장치 및 이를 위한 방법을 제공함에 있다. An object of the present invention is to provide an apparatus for providing an augmented reality-based video conference for multi-party online business collaboration and a method therefor.

상술한 바와 같은 목적을 달성하기 위한 본 발명의 바람직한 실시예에 따른 증강 현실 기반의 화상회의를 제공하기 위한 방법은 좌표생성부가 로컬영상 중 배경으로만 이루어진 배경로컬영상으로부터 심층학습(Deep Leaning)을 통해 학습된 변환모델을 이용하여 3차원 좌표로 표현되는 배경좌표벡터를 생성하고, 로컬영상 중 객체로만 이루어진 객체로컬영상으로부터 상기 변환모델을 이용하여 3차원 좌표로 표현되는 객체좌표벡터를 생성하는 단계와, 증강부가 상기 배경로컬영상에 상기 배경좌표벡터를 매핑하고, 상기 객체로컬영상에 상기 객체좌표벡터를 매핑하는 단계와, 상기 증강부가 상기 배경좌표벡터 및 상기 객체좌표벡터의 3차원 좌표에 따라 상기 배경좌표벡터에 매핑된 배경로컬영상에 상기 객체좌표벡터에 매핑된 객체로컬영상을 정합하여 증강영상을 생성하는 단계와, 상기 증강부가 화상회의에 참여한 사용자장치에 상기 증강영상을 제공하는 단계를 포함한다. In a method for providing an augmented reality-based video conference according to a preferred embodiment of the present invention for achieving the above object, the coordinate generator performs deep learning from a background local image consisting only of a background among local images. Generating a background coordinate vector expressed in three-dimensional coordinates using the transformation model learned through, and generating an object coordinate vector expressed in three-dimensional coordinates by using the transformation model from an object local image consisting only of objects among local images and an augmentation unit mapping the background coordinate vector to the background local image, and mapping the object coordinate vector to the object local image, wherein the augmentation unit according to the three-dimensional coordinates of the background coordinate vector and the object coordinate vector generating an augmented image by matching the background local image mapped to the background coordinate vector to the object local image mapped to the object coordinate vector; and providing the augmented image to the user device participating in the video conference by the augmentation unit include

상기 증강영상을 생성하는 단계는 상기 증강부가 객체의 위치를 조작하기 위한 입력을 수신하면, 수신된 입력에 따라 상기 객체좌표벡터의 3차원 좌표를 변경하는 단계와, 상기 증강부가 상기 객체좌표벡터의 상기 변경된 3차원 좌표에 따라 상기 배경로컬영상에 상기 객체로컬영상을 정합하여 증강영상을 생성하는 단계를 포함한다. The generating of the augmented image includes: when the augmentation unit receives an input for manipulating the position of an object, changing the three-dimensional coordinates of the object coordinate vector according to the received input; and generating an augmented image by registering the object local image with the background local image according to the changed three-dimensional coordinates.

상기 방법은 상기 객체좌표벡터를 생성하는 단계 전, 영상처리부가 적어도 하나의 사용자장치로부터 적어도 하나의 영상을 수신하면, 수신된 영상에서 배경과 객체를 분리하여 배경로컬영상 및 객체로컬영상을 생성하는 단계를 더 포함한다. The method includes generating a background local image and an object local image by separating a background and an object from the received image when the image processing unit receives at least one image from at least one user device before the step of generating the object coordinate vector further comprising steps.

상기 객체좌표벡터를 생성하는 단계는 상기 좌표생성부가 상기 배경로컬영상을 상기 변환모델에 입력하면, 상기 변환모델이 복수의 계층 간 가중치가 적용되는 연산을 수행하여 상기 배경좌표벡터를 생성하는 단계와, 상기 좌표생성부가 상기 객체로컬영상을 상기 변환모델에 입력하면, 상기 변환모델이 복수의 계층 간 가중치가 적용되는 연산을 수행하여 상기 객체좌표벡터를 생성하는 단계를 포함한다. The step of generating the object coordinate vector includes: when the coordinate generator inputs the background local image to the transformation model, the transformation model performs an operation in which a weight between a plurality of layers is applied to generate the background coordinate vector; , when the coordinate generator inputs the object local image to the transformation model, the transformation model generates the object coordinate vector by performing an operation in which a weight between a plurality of layers is applied.

상기 방법은 배경좌표벡터로 변환하는 단계 전, 학습부가 학습용 로컬영상 및 상기 학습용 로컬영상의 모든 픽셀 각각에 대응하여 실측된 3차원 좌표로 이루어진 실측좌표벡터를 포함하는 복수의 학습 데이터를 마련하는 단계와, 상기 학습부가 상기 복수의 학습 데이터 중 적어도 일부를 이용하여 식별망 및 변환망을 포함하는 변환모델의 상기 식별망이 상기 실측좌표벡터에 대해 실측값으로 판단하고, 상기 변환망에 의해 생성된 학습용 좌표벡터에 대해 실측값을 모사한 모사값으로 판단하도록 상기 식별망의 파라미터를 수정하는 최적화를 수행하는 제1 단계와, 상기 식별망이 상기 변환망에 의해 생성된 학습용 좌표벡터를 실측값으로 판단하도록 상기 변환망의 파라미터를 수정하는 최적화를 수행하는 제2 단계를 교번으로 수행하여 변환모델을 생성하는 단계를 더 포함한다. The method includes the steps of preparing a plurality of learning data including an actual measurement coordinate vector consisting of a three-dimensional coordinate measured corresponding to each pixel of a local image for learning and all pixels of the local image for learning by the learning unit before the step of converting to a background coordinate vector And, the identification network of a transformation model including an identification network and a transformation network using at least a part of the plurality of learning data by the learning unit determines the actual measurement value for the measured coordinate vector, and the transformation network generated A first step of optimizing the parameters of the identification network to be determined as a simulated value obtained by simulating the actual measurement value for the training coordinate vector; The method further includes generating a transformation model by alternately performing a second step of performing optimization of modifying parameters of the transformation network to determine.

상기 제1 단계는 상기 학습부가 식별손실함수

에 의해 산출되는 식별손실이 최대가 되도록 상기 변환망의 가중치는 수정하지 않고 상기 식별망의 가중치를 수정하는 최적화를 수행하는 단계를 포함한다. 여기서, 상기 Lds(x)는 식별손실함수이고, 상기 GT는 실측좌표벡터이고, 상기 x는 식별망에 대한 입력으로 학습용 좌표벡터 혹은 실측좌표벡터이고, 상기 D(x)는 식별망이 상기 x에 대해 복수의 계층 간 가중치가 적용되는 복수의 연산을 수행한 결과인 식별값인 것을 특징으로 한다. The first step is the learning unit identification loss function

and performing optimization of correcting the weight of the identification network without modifying the weight of the transformation network so that the identification loss calculated by α is maximized. Here, Lds(x) is an identification loss function, GT is an actual coordinate vector, x is an input to an identification network, and is a learning coordinate vector or an actual measurement coordinate vector, and D(x) is an identification network using the x It is characterized in that it is an identification value that is a result of performing a plurality of operations to which a plurality of inter-layer weights are applied.

상기 제2 단계는 상기 학습부가 변환손실함수

에 의해 산출되는 변환손실이 최대가 되도록 상기 식별망의 가중치는 수정하지 않고 상기 변환망의 가중치를 수정하는 최적화를 수행하는 단계를 포함한다. 여기서, 상기 Ltn(z)는 변환손실함수이고, 상기 z는 변환망에 대한 입력으로, 학습용 로컬영상이고, 상기 G(z)는 변환망이 학습용 로컬영상에 대해 복수의 계층 간 가중치가 적용되는 복수의 연산을 통해 산출한 학습용 좌표벡터이고, 상기 D(G(z))는 식별망이 입력되는 상기 G(z)에 대해 복수의 계층 간 가중치가 적용되는 복수의 연산을 수행한 결과인 식별값인 것을 특징으로 한다. In the second step, the learning unit is a conversion loss function

and performing optimization of modifying the weight of the conversion network without modifying the weight of the identification network so that the conversion loss calculated by . Here, Ltn(z) is a transformation loss function, z is an input to a transformation network, a local image for learning, and G(z) is a transformation network to which a weight between a plurality of layers is applied to a local image for learning. It is a coordinate vector for learning calculated through a plurality of operations, and the D(G(z)) is a result of performing a plurality of operations in which a plurality of inter-layer weights are applied to the G(z) to which the identification network is input. It is characterized as a value.

상술한 바와 같은 목적을 달성하기 위한 본 발명의 바람직한 실시예에 따른 증강 현실 기반의 화상회의를 제공하기 위한 장치는 로컬영상 중 배경으로만 이루어진 배경로컬영상으로부터 심층학습(Deep Leaning)을 통해 학습된 변환모델을 이용하여 3차원 좌표로 표현되는 배경좌표벡터를 생성하고, 로컬영상 중 객체로만 이루어진 객체로컬영상으로부터 상기 변환모델을 이용하여 3차원 좌표로 표현되는 객체좌표벡터를 생성하는 좌표생성부와, 상기 배경로컬영상에 상기 배경좌표벡터를 매핑하고, 상기 객체로컬영상에 상기 객체좌표벡터를 매핑하고, 상기 배경좌표벡터 및 상기 객체좌표벡터의 3차원 좌표에 따라 상기 배경좌표벡터에 매핑된 배경로컬영상에 상기 객체좌표벡터에 매핑된 객체로컬영상을 정합하여 증강영상을 생성하고, 화상회의에 참여한 사용자장치에 상기 증강영상을 제공하는 증강부를 포함한다. An apparatus for providing an augmented reality-based video conference according to a preferred embodiment of the present invention for achieving the above object is a background local image made up of only a background among local images through deep learning. A coordinate generator for generating a background coordinate vector expressed in three-dimensional coordinates using a transformation model, and generating an object coordinate vector expressed in three-dimensional coordinates by using the transformation model from an object local image consisting of only objects among local images; , the background coordinate vector is mapped to the background local image, the object coordinate vector is mapped to the object local image, and the background mapped to the background coordinate vector according to the three-dimensional coordinates of the background coordinate vector and the object coordinate vector and an augmentation unit for generating an augmented image by matching the local image with the object local image mapped to the object coordinate vector, and providing the augmented image to a user device participating in a video conference.

본 발명에 따르면, 화상회의에 참여한 모든 사용자들 각각이 동일한 객체를 개별적으로 자신의 사이트가 배경으로 반영된 증강 현실에서 조작하여 테스트할 수 있다. 이에 따라, 화상회의에 참여한 모든 사용자들에게 해당 객체에 대해 실감나는 사용자 경험을 제공할 수 있다. According to the present invention, all users participating in the video conference can individually manipulate and test the same object in augmented reality in which their site is reflected as a background. Accordingly, it is possible to provide a realistic user experience for the object to all users participating in the video conference.

도 1은 본 발명의 실시예에 따른 증강현실 기반의 화상회의를 제공하기 위한 시스템의 구성을 설명하기 위한 도면이다.
도 2는 본 발명의 실시예에 따른 화상회의를 제공하기 위한 사용자장치의 구성을 설명하기 위한 도면이다.
도 3은 본 발명의 실시예에 따른 증강현실 기반의 화상회의를 제공하기 위한 화상회의서버의 구성을 설명하기 위한 도면이다.
도 4는 본 발명의 실시예에 따른 증강현실을 제공하기 위한 제어모듈의 세부 구성을 설명하기 위한 블록도이다.
도 5는 본 발명의 실시예에 따른 배경로컬영상과 객체로컬영상을 생성하는 방법을 설명하기 위한 화면 예이다.
도 6은 본 발명의 실시예에 따른 증강현실을 제공하기 위한 변환모델의 구성을 설명하기 위한 도면이다.
도 7은 본 발명의 실시예에 따른 객체로컬영상을 배경로컬영상에 정합하는 방법을 설명하기 위한 도면이다.
도 8은 본 발명의 실시예에 따른 변환모델을 생성하는 방법을 설명하기 위한 흐름도이다.
도 9는 본 발명의 실시예에 따른 변환모델의 식별망을 최적화하는 방법을 설명하기 위한 흐름도이다.
도 10은 본 발명의 실시예에 따른 변환모델의 변환망을 최적화하는 방법을 설명하기 위한 흐름도이다.
도 11은 본 발명의 실시예에 따른 증강현실 기반의 화상회의를 제공하기 위한 방법을 설명하기 위한 흐름도이다. 1 is a diagram for explaining the configuration of a system for providing an augmented reality-based video conference according to an embodiment of the present invention.
2 is a diagram for explaining the configuration of a user device for providing a video conference according to an embodiment of the present invention.
3 is a diagram for explaining the configuration of a video conference server for providing an augmented reality-based video conference according to an embodiment of the present invention.
4 is a block diagram illustrating a detailed configuration of a control module for providing augmented reality according to an embodiment of the present invention.
5 is a screen example for explaining a method of generating a background local image and an object local image according to an embodiment of the present invention.
6 is a diagram for explaining the configuration of a transformation model for providing augmented reality according to an embodiment of the present invention.
7 is a diagram for explaining a method of matching an object local image to a background local image according to an embodiment of the present invention.
8 is a flowchart illustrating a method of generating a transformation model according to an embodiment of the present invention.
9 is a flowchart illustrating a method of optimizing an identification network of a transformation model according to an embodiment of the present invention.
10 is a flowchart illustrating a method of optimizing a transformation network of a transformation model according to an embodiment of the present invention.
11 is a flowchart illustrating a method for providing an augmented reality-based video conference according to an embodiment of the present invention.

본 발명의 상세한 설명에 앞서, 이하에서 설명되는 본 명세서 및 청구범위에 사용된 용어나 단어는 통상적이거나 사전적인 의미로 한정해서 해석되어서는 아니 되며, 발명자는 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념으로 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야만 한다. 따라서 본 명세서에 기재된 실시예와 도면에 도시된 구성은 본 발명의 가장 바람직한 실시예에 불과할 뿐, 본 발명의 기술적 사상을 모두 대변하는 것은 아니므로, 본 출원시점에 있어서 이들을 대체할 수 있는 다양한 균등물과 변형 예들이 있을 수 있음을 이해하여야 한다. Prior to the detailed description of the present invention, the terms or words used in the present specification and claims described below should not be construed as being limited to their ordinary or dictionary meanings, and the inventors should develop their own inventions in the best way. It should be interpreted as meaning and concept consistent with the technical idea of the present invention based on the principle that it can be appropriately defined as a concept of a term for explanation. Accordingly, the embodiments described in this specification and the configurations shown in the drawings are only the most preferred embodiments of the present invention, and do not represent all the technical ideas of the present invention, so various equivalents that can replace them at the time of the present application It should be understood that there may be water and variations.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예들을 상세히 설명한다. 이때, 첨부된 도면에서 동일한 구성 요소는 가능한 동일한 부호로 나타내고 있음을 유의해야 한다. 또한, 본 발명의 요지를 흐리게 할 수 있는 공지 기능 및 구성에 대한 상세한 설명은 생략할 것이다. 마찬가지의 이유로 첨부 도면에 있어서 일부 구성요소는 과장되거나 생략되거나 또는 개략적으로 도시되었으며, 각 구성요소의 크기는 실제 크기를 전적으로 반영하는 것이 아니다. Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In this case, it should be noted that the same components in the accompanying drawings are denoted by the same reference numerals as much as possible. In addition, detailed descriptions of well-known functions and configurations that may obscure the gist of the present invention will be omitted. For the same reason, some components are exaggerated, omitted, or schematically illustrated in the accompanying drawings, and the size of each component does not fully reflect the actual size.

먼저, 본 발명의 실시예에 따른 다자간 온라인 업무 협업을 위한 증강현실(AR: augmented reality) 기반의 화상회의를 제공하기 위한 시스템에 대해서 설명하기로 한다. 도 1은 본 발명의 실시예에 따른 증강현실 기반의 화상회의를 제공하기 위한 시스템의 구성을 설명하기 위한 도면이다. First, a system for providing an augmented reality (AR)-based video conference for multi-party online business collaboration according to an embodiment of the present invention will be described. 1 is a diagram for explaining the configuration of a system for providing an augmented reality-based video conference according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 실시예에 따른 증강현실 기반의 화상회의를 제공하기 위한 시스템(이하, '화상회의시스템'으로 축약함)은 사용자장치(10) 및 화상회의서버(20)를 포함한다. Referring to FIG. 1 , a system (hereinafter, abbreviated as 'video conference system') for providing an augmented reality-based video conference according to an embodiment of the present invention includes a user device 10 and a video conference server 20. include

사용자장치(10)는 카메라 기능 및 통신 기능을 포함하는 장치이다. 사용자장치(10)는 화상회의에 참여하는 사용자가 사용하는 장치이며, 사용자장치(10)가 촬영한 영상을 화상회의서버(20)로 전송할 수 있다. The user device 10 is a device including a camera function and a communication function. The user device 10 is a device used by a user participating in a video conference, and may transmit an image captured by the user device 10 to the video conference server 20 .

화상회의서버(20)는 기본적으로, 화상회의에 참여한 복수의 사용자장치(10) 모두가 화상회의를 할 수 있도록 연결하기 위한 것이다. 특히, 화상회의서버(20)는 화상회의에 참여한 모든 사용자장치(10)와 세션을 연결하고, 이를 통해 화상회의를 제공할 수 있다. The video conference server 20 is basically for connecting all of the plurality of user devices 10 participating in the video conference to conduct a video conference. In particular, the video conference server 20 may connect a session with all the user devices 10 participating in the video conference, thereby providing a video conference.

화상회의서버(20)는 적어도 하나의 사용자장치(10)로부터 수신된 적어도 하나의 영상에서 배경을 추출한 영상인 배경로컬영상과, 객체를 추출한 영상인 객체로컬영상을 생성하고, 객체를 사용자의 조작에 따라 배경에서 사용자가 원하는 위치에 정합하여 표시되는 증강영상을 생성할 수 있다. 그러면, 화상회의서버(20)는 생성된 증강영상을 화상회의에 참여한 모든 사용자장치(10)에 제공할 수 있다. 이러한 증강영상을 생성하기 위하여, 화상회의서버(20)는 학습(Deep leanring)에 의해 생성되는 변환모델(TM)을 이용할 수 있다. The video conference server 20 generates a background local image that is an image extracted from the background from at least one image received from at least one user device 10 and an object local image that is an image extracted from an object, and manipulates the object by the user. Accordingly, it is possible to generate an augmented image displayed by matching to a location desired by the user in the background. Then, the video conference server 20 may provide the generated augmented image to all user devices 10 participating in the video conference. In order to generate such an augmented image, the video conference server 20 may use a transformation model TM generated by deep learning.

그러면, 본 발명의 실시예에 따른 화상회의를 제공하기 위한 사용자장치(10)에 대해서 설명하기로 한다. 도 2는 본 발명의 실시예에 따른 화상회의를 제공하기 위한 사용자장치의 구성을 설명하기 위한 도면이다. 도 2를 참조하면, 본 발명의 실시예에 따른 사용자장치(10)는 통신부(11), 카메라부(12), 센서부(13), 오디오부(14), 입력부(14), 표시부(15), 저장부(16) 및 제어부(17)를 포함한다. Next, the user device 10 for providing a video conference according to an embodiment of the present invention will be described. 2 is a diagram for explaining the configuration of a user device for providing a video conference according to an embodiment of the present invention. Referring to FIG. 2 , the user device 10 according to an embodiment of the present invention includes a communication unit 11 , a camera unit 12 , a sensor unit 13 , an audio unit 14 , an input unit 14 , and a display unit 15 . ), a storage unit 16 and a control unit 17 .

통신부(11)는 화상회의서버(20)와 통신을 위한 것이다. 통신부(11)는 송신되는 신호의 주파수를 상승 변환 및 증폭하는 RF(Radio Frequency) 송신기(Tx) 및 수신되는 신호를 저 잡음 증폭하고 주파수를 하강 변환하는 RF 수신기(Rx)를 포함할 수 있다. 그리고 통신부(11)는 송신되는 신호를 변조하고, 수신되는 신호를 복조하는 모뎀(Modem)을 포함할 수 있다. 통신부(11)는 제어부의 제어에 따라 화상회의서버(20)로 배경 및 객체가 포함된 영상을 전송할 수 있다. 또한, 통신부(11)는 화상회의서버(20)로부터 증강현실 영상 등을 수신할 수 있다. The communication unit 11 is for communication with the video conference server 20 . The communication unit 11 may include a radio frequency (RF) transmitter (Tx) for up-converting and amplifying the frequency of the transmitted signal, and an RF receiver (Rx) for low-noise amplifying the received signal and down-converting the frequency. In addition, the communication unit 11 may include a modem that modulates a transmitted signal and demodulates a received signal. The communication unit 11 may transmit an image including a background and an object to the video conference server 20 under the control of the control unit. Also, the communication unit 11 may receive an augmented reality image or the like from the video conference server 20 .

카메라부(12)는 영상을 촬영하기 위한 것이다. 카메라부(12)는 렌즈 및 이미지센서를 포함할 수 있다. 각 이미지센서는 피사체에서 반사되는 빛을 입력받아 전기신호로 변환한다. 이미지 센서는 CCD(Charged Coupled Device), CMOS(Complementary Metal-Oxide Semiconductor) 등을 기반으로 구현될 수 있다. 또한, 카메라부(12)는 하나 이상의 아날로그-디지털 변환기(Analog to Digital Converter)를 더 포함할 수 있으며, 이미지센서에서 출력되는 전기신호를 디지털 수열로 변환하여 제어부(17)로 출력할 수 있다. The camera unit 12 is for capturing an image. The camera unit 12 may include a lens and an image sensor. Each image sensor receives the light reflected from the subject and converts it into an electrical signal. The image sensor may be implemented based on a Charged Coupled Device (CCD), a Complementary Metal-Oxide Semiconductor (CMOS), or the like. In addition, the camera unit 12 may further include one or more analog-to-digital converters, and may convert an electric signal output from the image sensor into a digital sequence and output it to the controller 17 .

센서부(13)는 관성을 측정하기 위한 것이다. 이러한 센서부(13)는 관성센서(Inertial Measurement Unit: IMU), 도플러속도센서(Doppler Velocity Log: DVL) 및 자세방위각센서(Attitude and Heading Reference. System: AHRS) 등을 포함한다. 센서부(13)는 사용자장치(10)의 카메라부(12)의 3차원 좌표 상의 위치 및 오일러 각을 포함하는 관성 정보를 측정하여 측정된 사용자장치(10)의 관성 정보를 제어부(17)로 제공한다. The sensor unit 13 is for measuring inertia. The sensor unit 13 includes an Inertial Measurement Unit (IMU), a Doppler Velocity Log (DVL), an Attitude and Heading Reference System (AHRS), and the like. The sensor unit 13 measures inertial information including the position and Euler angle on the three-dimensional coordinates of the camera unit 12 of the user device 10 and transmits the measured inertial information of the user device 10 to the control unit 17 . to provide.

입력부(14)는 사용자장치(10)를 제어하기 위한 사용자의 키 조작을 입력받고 입력 신호를 생성하여 제어부(17)에 전달한다. 입력부(14)는 사용자장치(10)을 제어하기 위한 각 종 키들을 포함할 수 있다. 입력부(14)는 표시부(15)가 터치스크린으로 이루어진 경우, 각 종 키들의 기능이 표시부(15)에서 이루어질 수 있으며, 터치스크린만으로 모든 기능을 수행할 수 있는 경우, 입력부(14)는 생략될 수도 있다. The input unit 14 receives a user's key manipulation for controlling the user device 10 , generates an input signal, and transmits it to the control unit 17 . The input unit 14 may include various types of keys for controlling the user device 10 . In the input unit 14, when the display unit 15 is formed of a touch screen, the functions of various keys can be performed on the display unit 15, and when all functions can be performed only with the touch screen, the input unit 14 may be omitted. may be

표시부(15)는 사용자장치(10)의 메뉴, 입력된 데이터, 기능 설정 정보 및 기타 다양한 정보를 사용자에게 시각적으로 제공한다. 표시부(15)는 사용자장치(10)의 부팅 화면, 대기 화면, 메뉴 화면, 등의 화면을 출력하는 기능을 수행한다. 특히, 표시부(15)는 본 발명의 실시예에 따른 증강현실 영상을 화면으로 출력하는 기능을 수행한다. 이러한 표시부(15)는 액정표시장치(LCD, Liquid Crystal Display), 유기 발광 다이오드(OLED, Organic Light Emitting Diodes), 능동형 유기 발광 다이오드(AMOLED, Active Matrix Organic Light Emitting Diodes) 등으로 형성될 수 있다. 한편, 표시부(15)는 터치스크린으로 구현될 수 있다. 이러한 경우, 표시부(15)는 터치센서를 포함한다. 터치센서는 사용자의 터치 입력을 감지한다. 터치센서는 정전용량 방식(capacitive overlay), 압력식, 저항막 방식(resistive overlay), 적외선 감지 방식(infrared beam) 등의 터치 감지 센서로 구성되거나, 압력 감지 센서(pressure sensor)로 구성될 수도 있다. 상기 센서들 이외에도 물체의 접촉 또는 압력을 감지할 수 있는 모든 종류의 센서 기기가 본 발명의 터치센서로 이용될 수 있다. 터치센서는 사용자의 터치 입력을 감지하고, 터치된 위치를 나타내는 입력 좌표를 포함하는 감지 신호를 발생시켜 제어부(17)로 전송할 수 있다. 특히, 표시부(15)가 터치스크린으로 이루어진 경우, 입력부(14)의 기능의 일부 또는 전부는 표시부(15)를 통해 이루어질 수 있다. The display unit 15 visually provides a menu of the user device 10 , input data, function setting information, and other various information to the user. The display unit 15 performs a function of outputting a boot screen, a standby screen, a menu screen, and the like of the user device 10 . In particular, the display unit 15 performs a function of outputting an augmented reality image according to an embodiment of the present invention to the screen. The display unit 15 may be formed of a liquid crystal display (LCD), an organic light emitting diode (OLED), an active matrix organic light emitting diode (AMOLED), or the like. Meanwhile, the display unit 15 may be implemented as a touch screen. In this case, the display unit 15 includes a touch sensor. The touch sensor detects a user's touch input. The touch sensor may be composed of a touch sensing sensor such as a capacitive overlay, a pressure type, a resistive overlay, or an infrared beam, or may be composed of a pressure sensor. . In addition to the above sensors, all kinds of sensor devices capable of sensing contact or pressure of an object may be used as the touch sensor of the present invention. The touch sensor may detect a user's touch input, generate a detection signal including input coordinates indicating the touched position, and transmit it to the controller 17 . In particular, when the display unit 15 is formed of a touch screen, some or all of the functions of the input unit 14 may be performed through the display unit 15 .

저장부(16)는 사용자장치(10)의 동작에 필요한 프로그램 및 데이터를 저장하는 역할을 수행한다. 특히, 저장부(16)는 카메라 파라미터 등을 저장할 수 있다. 또한, 저장부(16)에 저장되는 각 종 데이터는 사용자장치(10) 사용자의 조작에 따라, 삭제, 변경, 추가될 수 있다. The storage unit 16 serves to store programs and data necessary for the operation of the user device 10 . In particular, the storage unit 16 may store camera parameters and the like. In addition, various types of data stored in the storage unit 16 may be deleted, changed, or added according to a user's operation of the user device 10 .

제어부(17)는 사용자장치(10)의 전반적인 동작 및 사용자장치(10)의 내부 블록들 간 신호 흐름을 제어하고, 데이터를 처리하는 데이터 처리 기능을 수행할 수 있다. 또한, 제어부(17)는 기본적으로, 사용자장치(10)의 각 종 기능을 제어하는 역할을 수행한다. 제어부(17)는 CPU(Central Processing Unit), BP(baseband processor), AP(application processor), GPU(Graphic Processing Unit), DSP(Digital Signal Processor) 등을 예시할 수 있다. The controller 17 may control the overall operation of the user device 10 and the signal flow between internal blocks of the user device 10 , and may perform a data processing function of processing data. Also, the control unit 17 basically serves to control various functions of the user device 10 . The controller 17 may include a central processing unit (CPU), a baseband processor (BP), an application processor (AP), a graphic processing unit (GPU), a digital signal processor (DSP), and the like.

제어부(17)는 웹 브라우저 기반의 화상회의를 위한 애플리케이션을 실행시키고, 실행된 애플리케이션을 통해 화상회의서버(20)와 연결한다. 화상회의서버(20)와 연결된 상태에서, 제어부(17)는 사용자의 조작에 따라 카메라부(12)를 통해 촬영된 영상 혹은 기 저장된 영상을 통신부(11)를 통해 화상회의서버(20)로 전송할 수 있다. 또한, 제어부(17)는 화상회의서버(20)로부터 증강현실 영상을 수신할 수 있다. 그러면, 제어부(17)는 증강영상을 표시부(15)를 통해 표시한다. 특히, 제어부(17)는 바람직하게, 증강영상 내의 객체에 대한 터치입력을 통해 객체의 위치를 조작하는 입력을 감지하고, 이러한 입력(예컨대, 터치된 위치를 나타내는 입력 좌표)를 통신부(11)를 통해 화상회의서버(20)로 전송할 수 있다. The controller 17 executes an application for a web browser-based video conference, and connects to the video conference server 20 through the executed application. In the state connected to the video conference server 20 , the controller 17 transmits an image captured through the camera unit 12 or a pre-stored image to the video conference server 20 through the communication unit 11 according to a user's operation. can In addition, the control unit 17 may receive the augmented reality image from the video conference server (20). Then, the control unit 17 displays the augmented image through the display unit 15 . In particular, the control unit 17 preferably detects an input for manipulating the position of the object through a touch input on the object in the augmented image, and transmits this input (eg, input coordinates indicating the touched position) to the communication unit 11 . It can be transmitted to the video conference server 20 through.

다음으로, 본 발명의 실시예에 따른 증강현실을 제공하기 위한 화상회의서버(20)에 대해서 설명하기로 한다. 도 3은 본 발명의 실시예에 따른 증강현실 기반의 화상회의를 제공하기 위한 화상회의서버의 구성을 설명하기 위한 도면이다. 도 3을 참조하면, 본 발명의 실시예에 따른 화상회의서버(20)는 통신모듈(21), 저장모듈(22) 및 제어모듈(23)을 포함한다. Next, a video conference server 20 for providing augmented reality according to an embodiment of the present invention will be described. 3 is a diagram for explaining the configuration of a video conference server for providing an augmented reality-based video conference according to an embodiment of the present invention. Referring to FIG. 3 , the video conference server 20 according to an embodiment of the present invention includes a communication module 21 , a storage module 22 , and a control module 23 .

통신모듈(21)은 네트워크를 통해 사용자장치(10)와 통신하기 위한 것이다. 통신모듈(21)은 사용자장치(10)와 데이터를 송수신 할 수 있다. 통신모듈(21)은 송신되는 신호의 주파수를 상승 변환 및 증폭하는 RF(Radio Frequency) 송신기(Tx) 및 수신되는 신호를 저 잡음 증폭하고 주파수를 하강 변환하는 RF 수신기(Rx)를 포함할 수 있다. 또한, 통신모듈(21)은 데이터를 송수신하기 위해 송신되는 신호를 변조하고, 수신되는 신호를 복조하는 모뎀(modem)을 포함할 수 있다. 이러한 통신모듈(21)은 제어모듈(23)로부터 전달 받은 데이터, 예컨대, 증강영상을 사용자장치(10)로 전송할 수 있다. 또한, 통신모듈(21)은 사용자장치(10)로부터 객체 및 배경을 포함하는 영상을 수신하고, 수신된 영상을 제어모듈(23)로 전달할 수 있다. The communication module 21 is for communicating with the user device 10 through a network. The communication module 21 may transmit/receive data to and from the user device 10 . The communication module 21 may include an RF (Radio Frequency) transmitter (Tx) for up-converting and amplifying the frequency of the transmitted signal, and an RF receiver (Rx) for low-noise amplifying the received signal and down-converting the frequency. . Also, the communication module 21 may include a modem for modulating a signal to be transmitted and demodulating a signal to be received in order to transmit/receive data. The communication module 21 may transmit data received from the control module 23 , for example, an augmented image to the user device 10 . Also, the communication module 21 may receive an image including an object and a background from the user device 10 , and transmit the received image to the control module 23 .

저장모듈(22)은 화상회의서버(20)의 동작에 필요한 프로그램 및 데이터를 저장하는 역할을 수행한다. 저장모듈(22)은 객체로컬영상 및 배경로컬영상을 포함하는 로컬영상과, 배경좌표벡터 및 객체좌표벡터를 포함하는 좌표벡터를 저장할 수 있다. 저장모듈(22)에 저장되는 각 종 데이터는 화상회의서버(20) 관리자의 조작에 따라 등록, 삭제, 변경, 추가될 수 있다. The storage module 22 serves to store programs and data necessary for the operation of the video conference server 20 . The storage module 22 may store a local image including an object local image and a background local image, and a coordinate vector including a background coordinate vector and an object coordinate vector. Various types of data stored in the storage module 22 may be registered, deleted, changed, or added according to the operation of the video conference server 20 administrator.

제어모듈(23)은 화상회의서버(20)의 전반적인 동작 및 화상회의서버(20)의 내부 블록들 간 신호 흐름을 제어하고, 데이터를 처리하는 데이터 처리 기능을 수행할 수 있다. 제어모듈(23)은 중앙처리장치(central processing unit), 디지털신호처리기(digital signal processor) 등이 될 수 있다. 또한, 제어모듈(23)은 추가로 이미지 프로세서(Image processor) 혹은 GPU(Graphic Processing Unit)를 더 구비할 수 있다. The control module 23 may control the overall operation of the video conference server 20 and signal flow between internal blocks of the video conference server 20 and perform a data processing function of processing data. The control module 23 may be a central processing unit, a digital signal processor, or the like. In addition, the control module 23 may further include an image processor or a graphic processing unit (GPU).

그러면, 전술한 제어모듈(23)의 증강현실을 제공하기 위한 세부적인 구성에 대해서 보다 상세하게 설명하기로 한다. 도 4는 본 발명의 실시예에 따른 증강현실을 제공하기 위한 제어모듈의 세부 구성을 설명하기 위한 블록도이다. 도 5는 본 발명의 실시예에 따른 배경로컬영상과 객체로컬영상을 생성하는 방법을 설명하기 위한 화면 예이다. 도 6은 본 발명의 실시예에 따른 증강현실을 제공하기 위한 변환모델의 구성을 설명하기 위한 도면이다. 도 7은 본 발명의 실시예에 따른 객체로컬영상을 배경로컬영상에 정합하는 방법을 설명하기 위한 도면이다. Then, a detailed configuration for providing the augmented reality of the aforementioned control module 23 will be described in more detail. 4 is a block diagram illustrating a detailed configuration of a control module for providing augmented reality according to an embodiment of the present invention. 5 is a screen example for explaining a method of generating a background local image and an object local image according to an embodiment of the present invention. 6 is a diagram for explaining the configuration of a transformation model for providing augmented reality according to an embodiment of the present invention. 7 is a diagram for explaining a method of matching an object local image to a background local image according to an embodiment of the present invention.

먼저, 도 4를 참조하면, 본 발명의 실시예에 따른 제어모듈(23)은 학습부(100), 영상처리부(200), 좌표생성부(300) 및 증강부(400)를 포함한다. First, referring to FIG. 4 , the control module 23 according to an embodiment of the present invention includes a learning unit 100 , an image processing unit 200 , a coordinate generating unit 300 , and an augmenting unit 400 .

영상처리부(200)는 사용자장치(10)로부터 영상을 수신하면, 해당 영상으로부터 로컬영상을 생성한다. 로컬영상은 배경으로만 이루어진 배경로컬영상 및 객체로만 이루어진 객체로컬영상을 포함한다. 예컨대, 사용자장치(10)는 도 5의 (A)와 같은 영상을 전송할 수 있다. 그러면, 영상처리부(200)는 배경과 객체를 분리하여 도 5의 (B)와 같은 배경로컬영상 및 도 5의 (C)와 같은 객체로컬영상을 생성할 수 있다. 이때, 배경과 객체를 분리하는 방법은 AMF(Approximated Median Filtering), 가우시안혼합모델(Gaussian Mix Model), 적응적가우시안혼합모델(Adaptive Gaussian Mixture Model), 고유배경모델(Eigen-background), 배경차분모델(background subtraction) 등 다양한 방법을 예시할 수 있지만, 이에 한정되는 것은 아니며, 다양한 방법을 단독 혹은 혼합하여 사용할 수 있을 것이다. 영상처리부(200)는 적어도 하나의 사용자장치(10)가 제공하는 적어도 하나의 영상으로부터 로컬영상을 마련한다. 여기서, 로컬영상은 로컬영상 중 배경으로만 이루어진 배경로컬영상 및 로컬영상 중 객체로만 이루어진 객체로컬영상을 포함한다. 일례로, 제1 사용자장치(11)가 배경로컬영상 및 객체로컬영상을 위해 제1 영상을 화상회의서버(20)에 제공할 수 있다. 그러면, 화상회의서버(20)의 영상처리부(200)는 제1 영상으로부터 배경과 객체를 분리하여 배경로컬영상과 객체로컬영상을 생성할 수 있다. 다른 예로, 제1 사용자장치(11)가 화상회의서버(20)에 배경로컬영상을 위한 제1 영상과, 객체로컬영상을 위한 제2 영상을 제공할 수 있다. 그러면, 화상회의서버(20)의 영상처리부(200)는 제1 영상으로부터 배경을 분리하여 배경로컬영상을 생성하고, 제2 영상으로부터 객체를 분리하여 객체로컬영상을 생성할 수 있다. 또 다른 예로, 제1 사용자장치(11)가 배경로컬영상을 위한 제1 영상을 화상회의서버(20)에 제공하고, 제2 사용자장치(12)가 객체로컬영상을 위해 제2 영상을 화상회의서버(20)에 제공할 수 있다. When receiving an image from the user device 10 , the image processing unit 200 generates a local image from the image. The local image includes a background local image composed only of a background and an object local image composed of only objects. For example, the user device 10 may transmit an image as shown in (A) of FIG. 5 . Then, the image processing unit 200 may generate a background local image as shown in FIG. 5B and an object local image as shown in FIG. 5C by separating the background and the object. In this case, the method of separating the background and the object is AMF (Approximated Median Filtering), Gaussian Mix Model, Adaptive Gaussian Mixture Model, Eigen-background, and Background Difference Model. Various methods such as (background subtraction) may be exemplified, but the present invention is not limited thereto, and various methods may be used alone or in combination. The image processing unit 200 prepares a local image from at least one image provided by the at least one user device 10 . Here, the local image includes a background local image composed of only a background among local images and an object local image composed of only objects among local images. For example, the first user device 11 may provide the first image to the video conference server 20 for the background local image and the object local image. Then, the image processing unit 200 of the video conference server 20 may generate a background local image and an object local image by separating the background and the object from the first image. As another example, the first user device 11 may provide the first image for the background local image and the second image for the object local image to the video conference server 20 . Then, the image processing unit 200 of the video conference server 20 may generate a background local image by separating the background from the first image, and may generate an object local image by separating the object from the second image. As another example, the first user device 11 provides a first image for the background local image to the video conference server 20, and the second user device 12 provides a video conference with the second image for the object local image It may be provided to the server 20 .

학습부(100)는 학습(deep learning)을 통해 변환모델(TM)을 생성한다. 구체적으로, 학습부(100)는 학습 데이터를 이용하여 변환모델(TM)이 로컬영상 중 배경으로만 이루어진 배경로컬영상이 입력되면, 배경로컬영상으로부터 배경이 3차원 좌표로 표현되는 배경좌표벡터를 생성하고, 로컬영상 중 객체로만 이루어진 객체로컬영상이 입력되면, 객체로컬영상으로부터 객체가 3차원 좌표로 표현되는 객체좌표벡터를 생성하도록 변환모델(TM)을 학습(deep learning)시킨다. 이러한 학습 방법에 대해서는 아래에서 더 상세하게 설명될 것이다. The learning unit 100 generates a transformation model TM through deep learning. Specifically, when the transformation model (TM) receives a background local image consisting only of the background among the local images using the learning data, the learning unit 100 selects a background coordinate vector in which the background is expressed in three-dimensional coordinates from the background local image. When an object local image consisting of only objects among the local images is input, the transformation model TM is deep learning to generate an object coordinate vector in which an object is expressed in three-dimensional coordinates from the object local image. These learning methods will be described in more detail below.

여기서, 도 6을 참조하면, 변환모델(TM)은 변환망(TN: Transformative Network) 및 식별망(DS: discriminative Network)을 포함한다. Here, referring to FIG. 6 , the transformation model TM includes a transformative network (TN) and a discriminative network (DS).

변환망(TN)은 인코더(EN) 및 디코더(DE)를 포함한다. 인코더(EN) 및 디코더(DE)를 포함하는 변환망(TN)은 가중치가 적용되는 복수의 연산을 수행하는 복수의 계층을 포함한다. 여기서, 복수의 계층은 컨볼루션(Convolution) 연산을 수행하는 컨볼루션계층(CL: Convolution Layer), 다운샘플링(Down Sampling) 연산을 수행하는 풀링계층(PL: Pooling Layer) 및 업샘플링(Up Sampling) 연산을 수행하는 언풀링(UL: Unpooling Layer) 계층 및 디컨불루션 연산을 수행하는 디컨불루션 계층(DL: Deconvolution Layer) 각각을 하나 이상 포함한다. 컨볼루션, 다운샘플링, 업샘플링 및 디컨불루션 연산 각각은 소정의 행렬로 이루어진 필터(커널)를 이용하며, 이러한 행렬의 원소의 값들이 가중치가 된다. The transformation network TN includes an encoder EN and a decoder DE. A transformation network TN comprising an encoder EN and a decoder DE includes a plurality of layers that perform a plurality of operations to which weights are applied. Here, the plurality of layers are a convolution layer (CL) that performs a convolution operation, a pooling layer (PL) that performs a down sampling operation, and an up-sampling (Up Sampling) layer. It includes at least one each of an unpooling layer (UL) layer that performs an operation and a deconvolution layer (DL) that performs a deconvolution operation. Each of the convolution, downsampling, upsampling, and deconvolution operations uses a filter (kernel) composed of a predetermined matrix, and values of elements of these matrices become weights.

변환망(TN)은 객체로컬영상 혹은 배경로컬영상인 로컬영상이 입력되면, 입력된 로컬영상에 대해 복수의 계층 간 가중치가 적용되는 복수의 연산을 수행하여 로컬영상의 픽셀 각각에 대응하는 3차원 좌표를 나타내는 좌표벡터를 생성한다. 즉, 객체로컬영상이 입력된 경우, 객체를 구성하는 픽셀의 3차원 좌표를 나타내는 객체좌표벡터를 생성하고, 배경로컬영상이 입력된 경우, 배경을 구성하는 픽셀의 3차원 좌표를 나타내는 배경좌표벡터를 생성한다. When a local image, which is an object local image or a background local image, is input, the transformation network (TN) performs a plurality of operations in which a plurality of inter-layer weights are applied to the input local image, thereby performing a three-dimensional (3D) corresponding to each pixel of the local image. Creates a coordinate vector representing the coordinates. That is, when an object local image is input, an object coordinate vector indicating 3D coordinates of pixels constituting an object is generated, and when a background local image is input, a background coordinate vector indicating 3D coordinates of pixels constituting the background create

식별망(DS)은 가중치가 적용되는 복수의 연산을 수행하는 복수의 계층을 포함한다. 여기서, 복수의 계층은 입력층(IL: Input Layer), 컨벌루션(convolution) 연산 및 활성화함수에 의한 연산을 수행하는 컨벌루션층(CL: Convolution Layer), 풀링(pooling 또는 sub-sampling) 연산을 수행하는 풀링층(PL: Pooling Layer), 활성화함수에 의한 연산을 수행하는 완전연결층(FL: Fully-connected Layer) 및 활성화함수에 의한 연산을 수행하는 출력층(OL: Output Layer)을 포함한다. 여기서, 컨볼루션층(CL), 풀링층(PL) 및 완전연결층(FL) 각각은 2 이상이 될 수도 있다. 컨볼루션층(CL) 및 풀링층(PL)은 적어도 하나의 특징맵(FM: Feature Map)으로 구성된다. 특징맵(FM)은 이전 계층의 연산 결과에 대해 가중치(W)를 적용한 값을 입력받고, 입력받은 값에 대한 연산을 수행한 결과로 도출된다. 이러한 가중치(W)는 소정 크기의 가중치 행렬인 필터 혹은 커널(W)을 통해 적용된다. 전술한 컨벌루션층(CL), 완결연결층(FL) 및 출력층(OL)에서 사용되는 활성화함수는 시그모이드(Sigmoid), 하이퍼볼릭탄젠트(tanh: Hyperbolic tangent), ELU(Exponential Linear Unit), ReLU(Rectified Linear Unit), Leakly ReLU, Maxout, Minout, Softmax 등을 예시할 수 있다. 컨벌루션층(CL), 완결연결층(FL) 및 출력층(OL)에 이러한 활성화함수 중 어느 하나를 선택하여 적용할 수 있다. The identification network (DS) includes a plurality of layers that perform a plurality of operations to which weights are applied. Here, the plurality of layers includes an input layer (IL), a convolution layer (CL: Convolution Layer) that performs an operation by a convolution operation and an activation function, and a pooling (pooling or sub-sampling) operation to perform an operation. It includes a pooling layer (PL), a fully-connected layer (FL) that performs an operation by an activation function, and an output layer (OL) that performs an operation by an activation function. Here, each of the convolution layer CL, the pooling layer PL, and the fully connected layer FL may be two or more. The convolution layer CL and the pooling layer PL include at least one feature map (FM). The feature map FM is derived as a result of receiving a value to which a weight W is applied to the operation result of the previous layer, and performing an operation on the input value. This weight W is applied through a filter or kernel W that is a weight matrix of a predetermined size. Activation functions used in the above-described convolutional layer (CL), final connection layer (FL) and output layer (OL) are Sigmoid, Hyperbolic tangent (tanh), Exponential Linear Unit (ELU), and ReLU. (Rectified Linear Unit), Leakly ReLU, Maxout, Minout, Softmax, etc. may be exemplified. Any one of these activation functions may be selected and applied to the convolutional layer CL, the final connection layer FL, and the output layer OL.

식별망(DS)은 학습에만 사용되며, 식별망(DS)은 학습용으로 마련된 실측좌표벡터 및 변환망(TN)이 생성한 좌표벡터 중 어느 하나의 좌표벡터가 입력되면, 입력된 좌표벡터에 대해 복수의 계층 간 가중치가 적용되는 복수의 연산을 수행하여 입력된 좌표벡터가 실측값인지 혹은 실측값을 모사한 모사값인지 여부를 확률로 나타내는 식별값을 산출한다. The identification network (DS) is used only for learning, and the identification network (DS) is an input coordinate vector among the measured coordinate vectors prepared for learning and the coordinate vectors generated by the transformation network (TN). A plurality of calculations to which a weight between a plurality of layers are applied is performed to calculate an identification value indicating with a probability whether the input coordinate vector is an actual value or a simulated value obtained by simulating an actual measurement value.

다시, 도 4를 참조하면, 좌표생성부(300)는 로컬영상을 심층학습(Deep Leaning)을 통해 학습된 변환모델(TM)을 이용하여 좌표벡터를 생성한다. 즉, 좌표생성부(300)가 배경로컬영상을 변환모델(TM)에 입력하면, 변환모델(TM)은 배경로컬영상으로부터 배경좌표벡터를 생성할 수 있다. 또한, 좌표생성부(300)가 객체로컬영상을 변환모델(TM)에 입력하면, 변환모델(TM)은 객체로컬영상으로부터 객체좌표벡터를 생성할 수 있다. Again, referring to FIG. 4 , the coordinate generator 300 generates a coordinate vector using a transformation model TM learned through deep learning of a local image. That is, when the coordinate generator 300 inputs the background local image to the transformation model TM, the transformation model TM may generate a background coordinate vector from the background local image. Also, when the coordinate generator 300 inputs the object local image to the transformation model TM, the transformation model TM may generate an object coordinate vector from the object local image.

증강부(400)는 좌표생성부(300)가 배경좌표벡터 및 객체좌표벡터를 생성하면, 배경로컬영상에 배경좌표벡터를 매핑하고, 객체로컬영상에 객체좌표벡터를 매핑한다. 또한, 증강부(400)는 배경좌표벡터 및 객체좌표벡터의 3차원 좌표에 따라 도 7의 (가)와 같이, 배경좌표벡터에 매핑된 배경로컬영상에 도 7의 (나)와 같이, 객체좌표벡터에 매핑된 객체로컬영상을 정합하여 증강영상을 생성한다. 이때, 도 7의 (나)에 도시된 바와 같이, 사용자장치(10)로부터 객체의 위치를 조작하기 위한 입력(IN)을 수신하는 경우, 수신된 입력(IN)에 따라 객체좌표벡터의 3차원 좌표를 변경하고, 배경좌표벡터의 3차원 좌표를 기준으로 입력에 따라 변경된 객체좌표벡터의 3차원 좌표에 따라 객체로컬영상을 배경로컬영상에 정합할 수 있다. 이와 같이, 증강영상이 생성되면, 증강부(400)는 통신모듈(21)을 통해 화상회의에 참여한 모든 사용자장치(10)에 증강영상을 전송할 수 있다. When the coordinate generator 300 generates the background coordinate vector and the object coordinate vector, the augmentation unit 400 maps the background coordinate vector to the background local image and maps the object coordinate vector to the object local image. In addition, the augmentation unit 400 according to the three-dimensional coordinates of the background coordinate vector and the object coordinate vector, as shown in FIG. 7 (A), the background local image mapped to the background coordinate vector as shown in FIG. An augmented image is generated by matching the object local image mapped to the coordinate vector. At this time, when receiving an input IN for manipulating the position of an object from the user device 10 as shown in FIG. The coordinates may be changed, and the object local image may be registered with the background local image according to the 3D coordinates of the object coordinate vector changed according to the input based on the 3D coordinates of the background coordinate vector. As such, when the augmented image is generated, the augmentation unit 400 may transmit the augmented image to all user devices 10 participating in the video conference through the communication module 21 .

전술한 바와 같이, 본 발명은 화상회의에서 증강현실을 제공할 수 있다. 이를 위하여, 우선 학습(Deep learning)을 통해 변환모델을 생성하여야 한다. 그러면, 본 발명의 실시예에 따른 변환모델을 생성하는 방법에 대해서 설명하기로 한다. 도 8은 본 발명의 실시예에 따른 변환모델을 생성하는 방법을 설명하기 위한 흐름도이다. 도 9는 본 발명의 실시예에 따른 변환모델의 식별망을 최적화하는 방법을 설명하기 위한 흐름도이다. 도 10은 본 발명의 실시예에 따른 변환모델의 변환망을 최적화하는 방법을 설명하기 위한 흐름도이다. As described above, the present invention can provide augmented reality in a video conference. To this end, first, a transformation model must be created through deep learning. Then, a method for generating a transformation model according to an embodiment of the present invention will be described. 8 is a flowchart illustrating a method of generating a transformation model according to an embodiment of the present invention. 9 is a flowchart illustrating a method of optimizing an identification network of a transformation model according to an embodiment of the present invention. 10 is a flowchart illustrating a method of optimizing a transformation network of a transformation model according to an embodiment of the present invention.

도 6 및 도 8을 참조하면, 학습부(100)는 S110 단계에서 복수의 학습 데이터를 마련한다. 여기서, 학습 데이터는 학습을 위해 카메라를 통해 촬영한 영상에서 추출된 배경로컬영상 혹은 객체로컬영상인 학습용 로컬영상 및 학습용 로컬영상의 모든 픽셀 각각에 대응하여 실측된 3차원 좌표로 이루어진 실측좌표벡터를 포함한다. 6 and 8 , the learning unit 100 prepares a plurality of learning data in step S110. Here, the learning data is an actual measurement coordinate vector consisting of three-dimensional coordinates measured corresponding to each pixel of a local image for learning and a local image for learning, which are background local images or object local images extracted from images captured by a camera for learning. include

그런 다음, 학습부(100)는 S120 단계에서 복수의 학습 데이터 중 적어도 일부를 이용하여 식별망(DS)을 학습시킨다. 이때, 학습부(100)는 식별망(DS)이 실측좌표벡터(GT)를 실측값으로 판단하고, 변환망(TN)에 의해 생성된 학습용 좌표벡터를 실측값을 모사한 모사값으로 판단하도록 식별망(DS)의 파라미터를 수정하는 최적화를 수행한다. Then, the learning unit 100 learns the identification network (DS) using at least some of the plurality of learning data in step S120. At this time, the learning unit 100 determines the identification network DS to determine the actual measurement coordinate vector GT as an actual value, and to determine the training coordinate vector generated by the transformation network TN as a simulated value simulating the actual measurement value. Optimization to modify the parameters of the identification network (DS) is performed.

이러한 S120 단계에 대해 도 9를 참조하여 보다 자세히 설명하면 다음과 같다. 도 9를 참조하면, 학습부(100)는 S210 단계에서 변환망(TN)에 학습용 로컬영상을 입력한다. 그러면, 변환망(TN)의 인코더(EN) 및 디코더(DE)는 S220 단계에서 입력되는 학습용 로컬영상 및 카메라 파라미터에 대해 복수의 계층 간 가중치가 적용되는 연산을 수행하여 학습용 은닉벡터 및 학습용 좌표벡터를 순차로 산출한다. The step S120 will be described in more detail with reference to FIG. 9 as follows. Referring to FIG. 9 , the learning unit 100 inputs a local image for learning to the transformation network TN in step S210 . Then, the encoder (EN) and the decoder (DE) of the transformation network (TN) perform an operation in which a plurality of inter-layer weights are applied to the local image and camera parameters for learning input in step S220, and a hidden vector for learning and a coordinate vector for learning are calculated sequentially.

학습부(100)는 S230 단계에서 학습용 좌표벡터 혹은 실측좌표벡터(GT)를 식별망(DS)에 입력한다. 여기서, 학습용 좌표벡터는 앞서(S220) 변환망(TN)에 의해 학습용 로컬영상으로부터 생성된 것이다. The learning unit 100 inputs the learning coordinate vector or the actual measured coordinate vector GT to the identification network DS in step S230. Here, the coordinate vector for learning is generated from the local image for learning by the transformation network TN previously (S220).

이와 같이, 학습용 좌표벡터 혹은 실측좌표벡터(GT)가 입력되면, 식별망(DS)은 S240 단계에서 그 입력에 대해 복수의 계층 간 가중치가 적용되는 복수의 연산을 수행하여 식별값 D(x)을 산출한다. 여기서, 식별값은 입력된 학습용 좌표벡터 혹은 실측좌표벡터(GT)가 실측좌표벡터(GT)일 확률을 나타낸다. As such, when the coordinate vector for learning or the actual measurement coordinate vector (GT) is input, the identification network DS performs a plurality of calculations in which a plurality of inter-layer weights are applied to the input in step S240 to obtain the identification value D(x) to calculate Here, the identification value represents a probability that the input coordinate vector for learning or the measured coordinate vector GT is the actual measured coordinate vector GT.

식별값이 산출되면, 학습부(100)는 S250 단계에서 식별손실함수에 의해 산출되는 식별손실이 최대가 되도록 변환망(TS)의 가중치는 수정하지 않고 식별망(DS)의 가중치를 수정하는 최적화를 수행한다. 이때, 식별손실함수는 다음의 수학식 1과 같다. When the identification value is calculated, the learning unit 100 optimizes the weight of the identification network DS without modifying the weight of the transformation network TS so that the identification loss calculated by the identification loss function in step S250 is maximized. carry out In this case, the identification loss function is expressed by the following Equation (1).

여기서, Lds(x)는 식별망(DS)을 학습시키기 위한 식별손실함수를 나타낸다. GT는 실측좌표벡터이며, x는 식별망(DS)에 대한 입력을 나타낸다. 이러한 입력 x는 학습용 좌표벡터 혹은 실측좌표벡터(GT)이다. 또한, D(x)는 식별값을 나타내며, 식별망(DS)이 입력 x에 대해 복수의 계층 간 가중치가 적용되는 복수의 연산을 수행한 결과이다. 즉, 학습부(100)는 식별망(DS)에 대한 학습 시, 식별망(DS)이 실측좌표벡터(GT)를 실측값으로 판단하고, 변환망(TN)에 의해 생성된 학습용 좌표벡터를 실측값을 모사한 모사값으로 판단하도록 식별망(DS)을 학습시킨다. 다른 말로, 학습부(100)는 입력 x가 학습 데이터에 포함되는 실측좌표벡터(GT)라면, 식별망(DS)이 입력 x가 실측값일 확률을 높게 산출하도록 식별값 D(x)를 최대화하고, 반대로 입력 x가 학습 데이터에 없는 것이고 변환망(TN)에 의해 변환된 학습용 좌표벡터라면, 식별망(DS)이 입력 x가 모사값일 확률을 높게 산출하도록 1-D(x)를 최대화한다. Here, Lds(x) represents an identification loss function for learning the identification network DS. GT is the actual coordinate vector, and x represents the input to the identification network (DS). This input x is a coordinate vector for learning or a measured coordinate vector (GT). In addition, D(x) represents an identification value, and is a result of the identification network DS performing a plurality of operations in which a plurality of inter-layer weights are applied to the input x. That is, the learning unit 100 determines that the identification network DS determines the measured coordinate vector GT as an actual value, and the learning coordinate vector generated by the transformation network TN when learning about the identification network DS. The identification network (DS) is trained to judge the actual measured value as the simulated value. In other words, if the input x is a measured coordinate vector (GT) included in the learning data, the learning unit 100 maximizes the identification value D(x) so that the identification network DS calculates a high probability that the input x is an actual value, and , Conversely, if the input x is not in the training data and is a coordinate vector for learning transformed by the transformation network (TN), the identification network (DS) maximizes 1-D(x) so that the probability that the input x is a simulated value is high.

다시, 도 8을 참조하면, 학습부(100)는 S130 단계에서 복수의 학습 데이터 중 적어도 일부를 이용하여 변환망(TN)을 학습시킨다. 이때, 학습부(100)는 식별망(DS)이 변환망(TN)에 의해 생성된 학습용 좌표벡터를 실측값으로 판단하도록 변환망(TN)의 파라미터, 즉, 가중치를 수정하는 최적화를 수행한다. Again, referring to FIG. 8 , the learning unit 100 trains the transformation network TN by using at least some of the plurality of learning data in step S130 . At this time, the learning unit 100 performs optimization of correcting the parameters of the transformation network TN, that is, the weight, so that the identification network DS determines the learning coordinate vector generated by the transformation network TN as an actual value. .

이러한 S130 단계에 대해 도 10을 참조하여 보다 자세히 설명하면 다음과 같다. 도 10을 참조하면, 학습부(100)는 S310 단계에서 변환망(TN)에 학습용 로컬영상을 입력한다. 그러면, 변환망(TN)의 인코더(EN) 및 디코더(DE)는 S320 단계에서 입력되는 학습용 로컬영상에 대해 복수의 계층 간 가중치가 적용되는 연산을 수행하여 학습용 좌표벡터를 산출한다. The step S130 will be described in more detail with reference to FIG. 10 as follows. Referring to FIG. 10 , the learning unit 100 inputs a local image for learning to the transformation network TN in step S310 . Then, the encoder EN and the decoder DE of the transformation network TN calculate a coordinate vector for learning by performing an operation in which a plurality of inter-layer weights are applied to the local image for learning input in step S320.

학습부(100)는 S330 단계에서 앞서(S320) 산출된 학습용 좌표벡터를 식별망(DS)에 입력한다. 이와 같이, 학습용 좌표벡터 혹은 실측좌표벡터(GT)가 입력되면, 식별망(DS)은 S340 단계에서 그 입력에 대해 복수의 계층 간 가중치가 적용되는 복수의 연산을 수행하여 식별값을 산출한다. 여기서, 식별값은 입력된 학습용 좌표벡터가 실측값일 확률을 나타낸다. The learning unit 100 inputs the learning coordinate vector calculated earlier (S320) in step S330 to the identification network DS. In this way, when the coordinate vector for learning or the actual measurement coordinate vector GT is input, the identification network DS calculates an identification value by performing a plurality of operations in which a plurality of inter-layer weights are applied to the input in step S340. Here, the identification value represents the probability that the input coordinate vector for learning is an actual value.

식별값이 산출되면, 학습부(100)는 S350 단계에서 변환망(TS)이 산출한 학습용 좌표벡터가 실측값임을 나타내는 변환손실함수에 의해 산출되는 변환손실이 최대가 되도록 식별망(DS)의 가중치는 수정하지 않고 변환망(TS)의 가중치를 수정하는 최적화를 수행한다. 이때, 변환손실함수는 다음의 수학식 2와 같다. When the identification value is calculated, the learning unit 100 of the identification network DS so that the transformation loss calculated by the transformation loss function indicating that the learning coordinate vector calculated by the transformation network TS in step S350 is an actual value is maximized. The weights are not modified, but optimization is performed to modify the weights of the transformation network (TS). In this case, the conversion loss function is expressed by the following Equation (2).

여기서, Ltn(z)는 변환망(TN)을 학습시키기 위한 변환손실함수를 나타낸다. z는 변환망(TN)에 대한 입력을 나타낸다. 이러한 입력 z는 학습용 로컬영상이다. 그리고 G(z)는 변환망(TN)이 학습용 로컬영상에 대해 복수의 계층 간 가중치가 적용되는 복수의 연산을 통해 산출한 학습용 좌표벡터이다. 또한, D(G(z))는 식별값으로, 식별망(DS)이 입력되는 G(z)에 대해 복수의 계층 간 가중치가 적용되는 복수의 연산을 수행한 결과이다. 즉, 학습부(100)는 변환망(TN)에 대한 학습 시, 식별망(DS)이 변환망(TN)에 의해 생성된 학습용 좌표벡터 G(z)를 실측값으로 판단하도록 변환망(TN)을 학습시킨다. 다른 말로, 학습부(100)는 변환망(TN)에 대한 입력 z가 학습 데이터에 포함되는 학습용 로컬영상일 때, 변환망(TN)에 의해 변환된 학습용 좌표벡터 G(z)가 식별망(DS)에 의해 실측값으로 판단할 확률을 높게 산출하도록 식별값 D(G(z))를 최대화하는 방향으로 학습을 수행한다. Here, Ltn(z) represents a transformation loss function for learning the transformation network TN. z represents the input to the transformation network (TN). This input z is a local image for learning. And G(z) is a learning coordinate vector calculated by the transformation network TN through a plurality of operations in which a plurality of inter-layer weights are applied to a local image for learning. In addition, D(G(z)) is an identification value, and is a result of performing a plurality of operations in which a plurality of inter-layer weights are applied to G(z) to which the identification network DS is input. That is, the learning unit 100 is a transformation network TN so that when learning about the transformation network TN, the identification network DS determines the coordinate vector G(z) for learning generated by the transformation network TN as an actual value. ) is learned. In other words, the learning unit 100, when the input z to the transformation network (TN) is a local image for learning included in the learning data, the coordinate vector G(z) for learning transformed by the transformation network (TN) is the identification network ( DS), the learning is performed in the direction of maximizing the identification value D(G(z)) so that the probability of determining it as an actual value is high.

다시, 도 8을 참조하면, 학습부(100)는 S140 단계에서 학습 완료 조건을 만족하는지 여부를 판단한다. 학습부(100)는 복수의 학습 데이터 중 평가용 학습 데이터 세트를 통해 변환모델(TM) 전체에 대한 연산을 수행한 후, 변환망(TN)이 생성한 학습용 좌표벡터에 대한 식별망(DS)의 식별값이 기 설정된 목표 범위 이내에서 변동이 없으면, 학습 완료 조건을 만족하는 것으로 판단할 수 있다. Again, referring to FIG. 8 , the learning unit 100 determines whether a learning completion condition is satisfied in step S140 . The learning unit 100 performs an operation on the entire transformation model (TM) through the learning data set for evaluation among the plurality of learning data, and then the identification network (DS) for the learning coordinate vector generated by the transformation network (TN) If the identification value of is not changed within the preset target range, it may be determined that the learning completion condition is satisfied.

S140 단계의 판단 결과, 학습 완료 조건을 만족하지 못하면, 학습부(100)는 전술한 S120 단계 및 S130 단계를 반복한다. 반면, S140 단계의 판단 결과, 학습 완료 조건을 만족하면, S150 단계에서 학습을 종료한다. 이로써, 학습된 파라미터, 즉, 가중치를 가지는 변환모델(TM)이 완성된다. As a result of the determination in step S140, if the learning completion condition is not satisfied, the learning unit 100 repeats steps S120 and S130 described above. On the other hand, as a result of the determination in step S140, if the learning completion condition is satisfied, learning is terminated in step S150. As a result, a transformation model TM having a learned parameter, that is, a weight is completed.

한편, 추가적인 실시예에 따르면, 학습부(100)는 S120 단계 및 S130 단계의 반복 시, 식별망(DS)의 학습에 사용되는 학습 데이터의 수와 변환망(TN)의 학습에 사용되는 학습 데이터의 수를 달리 적용할 수 있다. 예컨대, 목표 범위가 0.49(49%) 내지 0.51(51%)라고 가정한다. 즉, 학습부(100)의 최종적인 목표는 식별망(DS)이 변환망(TN)에 의해 생성된 학습용 좌표벡터에 대한 식별망(DS)의 식별값이 목표 범위 0.49(49%) 내지 0.51(51%) 내의 값이고, 그 값에서 변동이 없도록 하기 위한 것이다. On the other hand, according to an additional embodiment, when the learning unit 100 repeats steps S120 and S130, the number of learning data used for learning of the identification network (DS) and the learning data used for learning of the transformation network (TN) can be applied differently. For example, assume the target range is 0.49 (49%) to 0.51 (51%). That is, the final goal of the learning unit 100 is that the identification value of the identification network DS for the learning coordinate vector generated by the identification network DS is the target range 0.49 (49%) to 0.51. It is a value within (51%), so that there is no change in that value.

특히, 상승 그래디언트 방식으로 학습이 이루어지고, 즉, 손실함수에 의한 손실이 최대가되도록 학습되고, 식별망(DS)이 먼저 학습되기 때문에 식별망(DS)의 그래디언트가 급속하게 먼저 상승한다면, 변환망(TN)의 그래디언트는 상승할 여지가 없기 때문에 학습부(100)는 식별망(DS)의 학습 데이터의 수를 변환망(TN)의 학습 데이터의 수 보다 작게 배정하여 전술한 S120 단계 및 S130 단계를 반복할 수 있다. In particular, if learning is performed in a rising gradient manner, that is, the loss by the loss function is maximized, and the gradient of the identification network DS rises rapidly first because the identification network DS is trained first, the transformation Since there is no room for the gradient of the network TN to rise, the learning unit 100 allocates the number of learning data of the identification network DS to be smaller than the number of learning data of the transformation network TN in steps S120 and S130 described above. The steps can be repeated.

전술한 바와 같이, 학습이 완료되면, 학습이 완료된 변환모델(TM)을 이용하여 화상회의 중 증강현실을 제공할 수 있다. 이러한 방법에 대해 설명하기로 한다. 도 11은 본 발명의 실시예에 따른 증강현실 기반의 화상회의를 제공하기 위한 방법을 설명하기 위한 흐름도이다. As described above, when learning is completed, augmented reality can be provided during videoconference using the transformation model TM that has been trained. These methods will be described. 11 is a flowchart illustrating a method for providing an augmented reality-based video conference according to an embodiment of the present invention.

도 11을 참조하면, 먼저, 화상회의에 참여하는 복수의 사용자장치(10)와 화상회의서버(20)의 세션이 연결된 상태라고 가정한다. Referring to FIG. 11 , it is assumed that sessions of a plurality of user devices 10 participating in a video conference and a video conference server 20 are connected.

영상처리부(200)는 S410 단계에서 적어도 하나의 사용자장치(10)가 제공하는 적어도 하나의 영상으로부터 로컬영상을 마련한다. 여기서, 로컬영상은 로컬영상 중 배경으로만 이루어진 배경로컬영상 및 로컬영상 중 객체로만 이루어진 객체로컬영상을 포함한다. 일례로, 제1 사용자장치(11)가 배경로컬영상 및 객체로컬영상을 위해 제1 영상을 화상회의서버(20)에 제공할 수 있다. 그러면, 화상회의서버(20)의 영상처리부(200)는 제1 영상으로부터 배경과 객체를 분리하여 배경로컬영상과 객체로컬영상을 생성할 수 있다. 다른 예로, 제1 사용자장치(11)가 화상회의서버(20)에 배경로컬영상을 위한 제1 영상과, 객체로컬영상을 위한 제2 영상을 제공할 수 있다. 그러면, 화상회의서버(20)의 영상처리부(200)는 제1 영상으로부터 배경을 분리하여 배경로컬영상을 생성하고, 제2 영상으로부터 객체를 분리하여 객체로컬영상을 생성할 수 있다. 또 다른 예로, 제1 사용자장치(11)가 배경로컬영상을 위한 제1 영상을 화상회의서버(20)에 제공하고, 제2 사용자장치(12)가 객체로컬영상을 위해 제2 영상을 화상회의서버(20)에 제공할 수 있다. 그러면, 화상회의서버(20)의 영상처리부(200)는 제1 영상으로부터 배경을 분리하여 배경로컬영상을 생성하고, 제2 영상으로부터 객체를 분리하여 객체로컬영상을 생성할 수 있다. 전술한 바와 같이 영상처리부(200)가 마련한 배경로컬영상 및 객체로컬영상은 좌표생성부(300)에 제공된다. The image processing unit 200 prepares a local image from at least one image provided by the at least one user device 10 in step S410 . Here, the local image includes a background local image composed of only a background among local images and an object local image composed of only objects among local images. For example, the first user device 11 may provide the first image to the video conference server 20 for the background local image and the object local image. Then, the image processing unit 200 of the video conference server 20 may generate a background local image and an object local image by separating the background and the object from the first image. As another example, the first user device 11 may provide the first image for the background local image and the second image for the object local image to the video conference server 20 . Then, the image processing unit 200 of the video conference server 20 may generate a background local image by separating the background from the first image, and may generate an object local image by separating the object from the second image. As another example, the first user device 11 provides a first image for the background local image to the video conference server 20, and the second user device 12 provides a video conference with the second image for the object local image It may be provided to the server 20 . Then, the image processing unit 200 of the video conference server 20 may generate a background local image by separating the background from the first image, and may generate an object local image by separating the object from the second image. As described above, the background local image and the object local image prepared by the image processing unit 200 are provided to the coordinate generating unit 300 .

좌표생성부(300)는 S420 단계에서 로컬영상, 즉, 배경로컬영상 및 객체로컬영상 각각에 대해 앞서 도 8 내지 도 10을 통해 학습된 바와 같이 심층학습(Deep Leaning)을 통해 학습된 변환모델(TM)을 이용하여 좌표벡터를 도출한다. The coordinate generator 300 is a transformation model learned through deep learning as previously learned through FIGS. 8 to 10 for the local image, that is, the background local image and the object local image, respectively, in step S420 (Deep Leaning) TM) to derive the coordinate vector.

즉, 좌표생성부(300)는 배경로컬영상을 변환모델(TM)을 이용하여 배경을 이루는 픽셀이 3차원 좌표로 표현되는 배경좌표벡터로 변환한다. 또한, 좌표생성부(300)는 객체로컬영상을 변환모델(TM)을 이용하여 객체를 구성하는 픽셀이 3차원 좌표로 표현되는 객체좌표벡터로 변환한다. That is, the coordinate generator 300 converts the background local image into a background coordinate vector in which pixels constituting the background are expressed in three-dimensional coordinates using the transformation model TM. In addition, the coordinate generator 300 converts the object local image into an object coordinate vector in which pixels constituting the object are expressed in three-dimensional coordinates using the transformation model TM.

이러한 S420 단계에서, 좌표생성부(300)가 배경로컬영상 및 객체로컬영상 각각을 변환모델(TM)에 입력하면, 변환모델(TM)의 변환망(TN)은 복수의 계층 간 학습된 가중치가 적용되는 복수의 연산을 수행하여 배경좌표벡터 및 객체좌표벡터 각각을 생성한다. In this step S420, when the coordinate generator 300 inputs each of the background local image and the object local image to the transformation model TM, the transformation network TN of the transformation model TM receives the weights learned between the plurality of layers. By performing a plurality of operations to be applied, a background coordinate vector and an object coordinate vector are generated, respectively.

배경좌표벡터 및 객체좌표벡터가 생성되면, 증강부(400)는 S430 단계에서 배경로컬영상에 배경좌표벡터를 매핑하고, 객체로컬영상에 객체좌표벡터를 매핑한다. 이어서, 증강부(400)는 S440 단계에서 배경좌표벡터 및 객체좌표벡터의 3차원 좌표에 따라 배경좌표벡터에 매핑된 배경로컬영상에 객체좌표벡터에 매핑된 객체로컬영상을 정합하여 증강영상을 생성한다. 이때, 도 7에 도시된 바와 같이, 사용자장치(10)로부터 객체의 위치를 조작하기 위한 입력을 수신하는 경우, 수신된 입력에 따라 객체좌표벡터의 3차원 좌표를 변경하고, 배경좌표벡터의 3차원 좌표를 기준으로 입력에 따라 변경된 객체좌표벡터의 3차원 좌표에 따라 객체로컬영상을 배경로컬영상에 정합할 수 있다. 이와 같이, 증강영상이 생성되면, 증강부(400)는 S440 단계에서 통신부(11)를 통해 화상회의에 참여한 모든 사용자장치(10)에 증강영상을 전송한다. When the background coordinate vector and the object coordinate vector are generated, the augmentation unit 400 maps the background coordinate vector to the background local image in step S430 and maps the object coordinate vector to the object local image. Next, the augmentation unit 400 generates an augmented image by matching the object local image mapped to the object coordinate vector with the background local image mapped to the background coordinate vector according to the three-dimensional coordinates of the background coordinate vector and the object coordinate vector in step S440 do. At this time, as shown in FIG. 7 , when an input for manipulating the position of an object is received from the user device 10 , the 3D coordinate of the object coordinate vector is changed according to the received input, and 3 of the background coordinate vector is received. The object local image can be registered with the background local image according to the 3D coordinates of the object coordinate vector changed according to the input based on the dimensional coordinates. As such, when the augmented image is generated, the augmentation unit 400 transmits the augmented image to all user devices 10 participating in the video conference through the communication unit 11 in step S440 .

한편, 앞서 설명된 본 발명의 실시예에 따른 방법은 다양한 컴퓨터수단을 통하여 판독 가능한 프로그램 형태로 구현되어 컴퓨터로 판독 가능한 기록매체에 기록될 수 있다. 여기서, 기록매체는 프로그램 명령, 데이터 파일, 데이터구조 등을 단독으로 또는 조합하여 포함할 수 있다. 기록매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 예컨대 기록매체는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광 기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치를 포함한다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어를 포함할 수 있다. 이러한 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다. Meanwhile, the method according to the embodiment of the present invention described above may be implemented in the form of a program readable by various computer means and recorded in a computer readable recording medium. Here, the recording medium may include a program command, a data file, a data structure, etc. alone or in combination. The program instructions recorded on the recording medium may be specially designed and configured for the present invention, or may be known and available to those skilled in the art of computer software. For example, the recording medium includes magnetic media such as hard disks, floppy disks and magnetic tapes, optical recording media such as CD-ROMs and DVDs, and magneto-optical media such as floppy disks ( magneto-optical media), and hardware devices specially configured to store and execute program instructions such as ROM, RAM, flash memory, and the like. Examples of program instructions may include high-level languages that can be executed by a computer using an interpreter or the like, as well as machine language such as generated by a compiler. Such hardware devices may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

이상 본 발명을 몇 가지 바람직한 실시예를 사용하여 설명하였으나, 이들 실시예는 예시적인 것이며 한정적인 것이 아니다. 이와 같이, 본 발명이 속하는 기술분야에서 통상의 지식을 지닌 자라면 본 발명의 사상과 첨부된 특허청구범위에 제시된 권리범위에서 벗어나지 않으면서 균등론에 따라 다양한 변화와 수정을 가할 수 있음을 이해할 것이다. Although the present invention has been described above using several preferred embodiments, these examples are illustrative and not restrictive. As such, those of ordinary skill in the art to which the present invention pertains will understand that various changes and modifications can be made in accordance with the doctrine of equivalents without departing from the spirit of the present invention and the scope of rights set forth in the appended claims.

10: 사용자장치
20: 화상회의서버
100: 학습부
200: 영상처리부
300: 좌표생성부
400: 증강부 10: User device
20: video conference server
100: study department
200: image processing unit
300: coordinate generator
400: augmentation unit

Claims

A method for providing an augmented reality-based video conference, the method comprising:
The coordinate generator generates a background coordinate vector expressed in three-dimensional coordinates using a transformation model learned through deep learning from a background local image consisting only of a background among local images, and an object local consisting of only objects among local images. generating an object coordinate vector expressed in three-dimensional coordinates from an image using the transformation model;
mapping the background coordinate vector to the background local image by an augmentation unit, and mapping the object coordinate vector to the object local image;
generating an augmented image by matching, by the augmentation unit, the object local image mapped to the object coordinate vector to the background local image mapped to the background coordinate vector according to the three-dimensional coordinates of the background coordinate vector and the object coordinate vector; and
providing the augmented image to the user device participating in the video conference by the augmenting unit;
characterized by comprising
A method for providing videoconferencing.

According to claim 1,
The step of generating the augmented image is
when the augmentation unit receives an input for manipulating the position of the object, changing the three-dimensional coordinates of the object coordinate vector according to the received input; and
generating an augmented image by matching the object local image to the background local image according to the changed three-dimensional coordinates of the object coordinate vector by the augmentation unit;
characterized by comprising
A method for providing videoconferencing.

According to claim 1,
Before the step of generating the object coordinate vector,
generating a background local image and an object local image by separating a background and an object from the received image when the image processing unit receives at least one image from the at least one user device;
characterized in that it further comprises
A method for providing videoconferencing.

According to claim 1,
The step of generating the object coordinate vector is
when the coordinate generator inputs the background local image to the transformation model, the transformation model generates the background coordinate vector by performing an operation in which a weight between a plurality of layers is applied; and
generating the object coordinate vector by performing, by the transformation model, an operation in which a weight between a plurality of layers is applied when the coordinate generator inputs the object local image to the transformation model;
characterized by comprising
A method for providing videoconferencing.

According to claim 1,
Before the step of converting to background coordinate vector,
providing, by a learning unit, a plurality of learning data including a local image for learning and an actual measurement coordinate vector consisting of three-dimensional coordinates measured corresponding to each pixel of the local image for learning; and
the learning department
Using at least a part of the plurality of learning data, the identification network of a transformation model including an identification network and a transformation network determines the measured coordinate vector as an actual value, and for a learning coordinate vector generated by the transformation network A first step of performing optimization of modifying the parameters of the identification network to determine the actual measured value as a simulated value;
a second step of optimizing, by the identification network, modifying the parameters of the transformation network to determine the learning coordinate vector generated by the transformation network as an actual value
generating a transformation model by performing it alternately;
characterized in that it further comprises
A method for providing videoconferencing.

6. The method of claim 5,
The first step is
the learning department
Identification loss function

performing optimization of correcting the weight of the identification network without modifying the weight of the transformation network so that the identification loss calculated by
includes,
The Lds(x) is an identification loss function,
Wherein GT is a measured coordinate vector,
Where x is an input to the identification network, and is a learning coordinate vector or an actual measurement coordinate vector,
The D(x) is an identification value that is a result of the identification network performing a plurality of operations to which a plurality of inter-layer weights are applied to the x
A method for providing videoconferencing.

6. The method of claim 5,
The second step is
the learning department
conversion loss function

performing optimization of modifying the weight of the conversion network without modifying the weight of the identification network so that the conversion loss calculated by ? is maximized;
includes,
The Ltn(z) is a conversion loss function,
The z is an input to the transformation network, a local image for learning,
The G(z) is a learning coordinate vector calculated by the transformation network through a plurality of operations in which a plurality of inter-layer weights are applied to a local image for learning,
The D(G(z)) is an identification value that is a result of performing a plurality of operations in which a plurality of inter-layer weights are applied to the G(z) to which the identification network is input.
A method for providing videoconferencing.

In an apparatus for providing an augmented reality-based video conference,
A background coordinate vector expressed in three-dimensional coordinates is generated by using a transformation model learned through deep learning from a background local image consisting of only the background among local images, and from the object local image consisting only of objects among local images. a coordinate generator for generating an object coordinate vector expressed in three-dimensional coordinates using a transformation model; and
The background coordinate vector is mapped to the background local image, the object coordinate vector is mapped to the object local image, and the background coordinate vector is mapped to the background coordinate vector according to the three-dimensional coordinates of the background coordinate vector and the object coordinate vector. an augmentation unit for generating an augmented image by matching an object local image mapped to the object coordinate vector to an image, and providing the augmented image to a user device participating in a video conference;
characterized by comprising
A device for providing video conferencing.